Experience sharing to 0.4

InfiniteWing

2017-08-15

Hello, this notebook will demonstrate my solution and some experiences. Though my method can't beat sh1ng's baseline, but it should be helpful to someone.

( I ensemble my solution with sh1ng's baseline to reach 0.439 on private LB. )

Overview

I got different gain from each part of work, you can have a quick look on them.

  1. 鯤's base XGBoost model, 0.380
  2. Faron's F1-opitimize, gain ~0.015
  3. Add aisle and department features, gain ~0.005
  4. Use Single None model, gain ~0.0007

Data overview

These work have been done by many awesome kernel/discussion contributors. You can take a look on them:

  1. Exploratory Analysis - Instacart, by Philipp Spachtholz
  2. Simple Exploration Notebook - Instacart, by SRK

The relationships between each input datas(by Jordan Tremoureux, from this post ):

data relationships

Public kernels

  1. light GBM benchmark 0.3692, by paulantoine
  2. Instacart XGBoost Starter - LB 0.3791, by Fabienvs
  3. LB 0.3805009, Python Edition, by é²²

You should read and try to know these public kernels. I take the é²²'s kernel as my base model because it has better baseline score and it's written in python.

F1-opitimize

The F1-opitimize concept is discuss detailly in this post. Here's two papers about F1-opitimize:

  1. Optimizing F-measure: A Tale of Two Approaches
  2. Thresholding Classifiers to Maximize F1 Score

And there's public kernels which already implements F1-opitimize:

  1. F1-Score Expectation Maximization in O(n²), by Faron
  2. Approximate caclulation of EF1 (need O(N) ), by Kruegger

The concept of F1-opitimize is, take the input probability as ground-truth probability. Use mathtical method to calculate the best item length to get the best F1 score. ( It's my understanding, if I make misunderstand, please correct me. )

You can take a look on the output image of Faron's kernel:

Faron's kernel

It means when your predict probability are: [0.45, 0.35, 0.31, 0.29, 0.27, 0.25, 0.22, 0.20, 0.17, 0.15, 0.10, 0.05, 0.02], you should pick the top 7 items.

Merge F1-opitimize to public kernel

# Continue 鯤's kernel
X_test.loc[:,'reordered']=(bst.predict(d_test)).astype(float)
preds=X_test['reordered'].values
order_ids=X_test['order_id'].values
product_ids=X_test['product_id'].values
order2preds={}

for i in range(len(preds)):
    order_id=order_ids[i]
    product_id=product_ids[i]
    pred=preds[i]
    if(order_id not in order2preds):
        order2preds[order_id]={}
    order2preds[order_id][product_id]=pred
final_preds=[]
final_order_ids=list(order2preds.keys())
print("Start F1 opitimize")
for order_id in tqdm(order2preds):
    product2preds=order2preds[order_id]
    product2preds=sorted(product2preds.items(), key=lambda x:x[1],reverse=True)
    probabilities=[v[1] for v in product2preds]
    products=[str(v[0]) for v in product2preds]
    # Use Faron's F1-opitimize
    opt=F1Optimizer.maximize_expectation(probabilities)
    best_k=opt[0]
    pred=products[:best_k]
    if(opt[1]):
        pred.append('None')

    final_preds.append(' '.join(pred))
submit=pd.DataFrame({'order_id': final_order_ids, 'products': final_preds})
submit=submit.sort_values('order_id')
submit.to_csv('sub.csv', index=False)

Create more features

For myself, I add several aisle and department features. It gives me around ~0.005 gain to reach 0.4 score on LB.

create_departments_aisles_features.py

After you save the new features to file, you can simply load it by pandas. And then merge the new features with original features. Such like:

new_feature_df = pd.read_csv("departments_aisles_features.csv")
data=data.merge(new_feature_df,on=['user_id', 'product_id'], how='left')

Handling None

I use simple none prediction at first. If an customer's none order rate >=0.25, then I will append an 'None' to prediction. It helps me to gain ~0.006 at beginning. After Faron reveal his F1-opitimize code, I find it will gain more if I use the None probability which calculated by F1-opitimize. However, at last day before competition end, I start to build a single None model which will handle None's probability. It gives me ~0.0007 gain on local cv and LB. Here's how I did it, first we need to build some features for None model:

create_none_features_train.py

In [1]:
from tqdm import tqdm
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import statistics
orders={}
products={}
users={}
products_aisles={}
products_departments={}
orders_isnone={}
orders_dows={}
orders_hours={}
orders_days_since_prior_order={}
train_days_since_prior_order={}
print("Start loading products.csv")
products_df=pd.read_csv('../input/products.csv', encoding = 'UTF-8')
for index, row in tqdm(products_df.iterrows()):
    product_id=str(int(row['product_id']))
    aisles_id=str(int(row['aisle_id']))
    departments_id=str(int(row['department_id']))
    products_aisles[product_id]=aisles_id
    products_departments[product_id]=departments_id
    

print("Start loading order_products__prior.csv")
#read prior orders
fr = open("../input/order_products__prior.csv", 'r', encoding = 'UTF-8')
fr.readline()# skip header
lines=fr.readlines()
for i,line in tqdm(enumerate(lines)):
    datas=line.replace("\n","").split(",")
    order_id=(datas[0])
    product_id=(datas[1])
    reorderer=int(datas[3])
    if(order_id not in orders):
        orders[order_id]={}
    orders[order_id][product_id]=reorderer

print("Start loading order_products__train.csv")
#read train orders
fr = open("../input/order_products__train.csv", 'r', encoding = 'UTF-8')
fr.readline()# skip header
lines=fr.readlines()
for i,line in tqdm(enumerate(lines)):
    datas=line.replace("\n","").split(",")
    order_id=(datas[0])
    product_id=(datas[1])
    reorderer=int(datas[3])
    if(order_id not in orders):
        orders[order_id]={}
    orders[order_id][product_id]=reorderer

for order_id in orders:
    reorder_list=list(orders[order_id].values())
    if(sum(reorder_list)==0):
        orders_isnone[order_id]=1
    else:
        orders_isnone[order_id]=0
    
print("Start loading orders.csv")
#read orders
fr = open("../input/orders.csv", 'r', encoding = 'UTF-8')

fr.readline()# skip header
lines=fr.readlines()
for i,line in tqdm(enumerate(lines)):
    datas=line.replace("\n","").split(",")
    order_id=(datas[0])
    user_id=(datas[1])
    eval_set=(datas[2])
    order_number=(datas[3])
    order_dow=int(datas[4])
    order_hours=int(datas[5])
    orders_dows[order_id]=order_dow
    orders_hours[order_id]=order_hours
    if(user_id not in users):
        users[user_id]={}
    if(eval_set=="prior"):
        try:
            days_since_prior_order=int(float(datas[6]))
            if(user_id not in orders_days_since_prior_order):
                orders_days_since_prior_order[user_id]=[]
            orders_days_since_prior_order[user_id].append(days_since_prior_order)
        except:
            pass
        users[user_id][order_number]=order_id
    elif(eval_set=="train"):
        users[user_id]["train"]=order_id
        days_since_prior_order=int(float(datas[6]))
        train_days_since_prior_order[order_id]=days_since_prior_order
    elif(eval_set=="test"):
        users[user_id]["test"]=order_id
    elif(eval_set=="valid"):
        users[user_id]["valid"]=order_id


print("Start creating features")

user_buytime_mean={}
user_buydow_mean={}
user_products={}
user_departments={}
user_aisles={}
user_none_order_rate={}
user_total_products={}
user_order_len={}
user_overall_reorder={}

for user_id in tqdm(users):
    if('train' not in list(users[user_id].keys())):
        continue
    user_buytime_mean[user_id]=[]
    user_buydow_mean[user_id]=[]
    user_products[user_id]=[]
    user_departments[user_id]=[]
    user_aisles[user_id]=[]
    user_none_order_rate[user_id]=[]
    user_order_len[user_id]=[]
    user_total_products[user_id]=0
    user_overall_reorder[user_id]=[]
    for i,(order_number, order_id) in enumerate(users[user_id].items()):
        if(order_number in ["test","train","valid"]):
            continue
        user_buytime_mean[user_id].append(orders_hours[order_id])
        user_buydow_mean[user_id].append(orders_dows[order_id])
        if(int(order_number)!=1):
            user_none_order_rate[user_id].append(orders_isnone[order_id])
        user_order_len[user_id].append(len(orders[order_id]))
        for product_id in orders[order_id]:
            user_total_products[user_id]+=1
            if(product_id not in user_products[user_id]):
                user_products[user_id].append(product_id)
            
            aisles_id=products_aisles[product_id]
            departments_id=products_departments[product_id]
            
            if(departments_id not in user_departments[user_id]):
                user_departments[user_id].append(departments_id)
            if(aisles_id not in user_aisles[user_id]):
                user_aisles[user_id].append(aisles_id)
            user_overall_reorder[user_id].append(orders[order_id][product_id])

outcsv = open("none_train_datas.csv", 'w', encoding = 'UTF-8')


cols=[]
cols.append("user_id")
cols.append("order_id")

cols.append("total_product")
cols.append("total_order")
cols.append("total_distinct_product")
cols.append("totoal_dep")
cols.append("totoal_aisle")

cols.append("order_dow")
cols.append("order_hour")
cols.append("days_since_prior_order")

cols.append("overall_reorder_rate")
cols.append("none_order_rate")

cols.append("mean_order_dow")
cols.append("mean_order_hour")
cols.append("mean_days_since_prior_order")
cols.append("mean_basket_size")

cols.append("is_none")

outcsv.writelines(','.join(cols)+"\n")
outcsv.flush()

print("Start saving features to csv")
for user_id in tqdm(users):
    if('train' not in list(users[user_id].keys())):
        continue
    
    order_id=users[user_id]["train"]
    
    features=[]
    features.append(user_id)
    features.append(order_id)
    
    features.append(user_total_products[user_id])
    features.append(len(users[user_id])-1)
    features.append(len(user_products[user_id]))
    features.append(len(user_departments[user_id]))
    features.append(len(user_aisles[user_id]))
    
    features.append(orders_dows[order_id])
    features.append(orders_hours[order_id])
    features.append(train_days_since_prior_order[order_id])
    
    overall_reorder_rate=sum(user_overall_reorder[user_id])/len(user_overall_reorder[user_id])
    none_order_rate=sum(user_none_order_rate[user_id])/len(user_none_order_rate[user_id])
    mean_order_dow=sum(user_buydow_mean[user_id])/len(user_buydow_mean[user_id])
    mean_order_hour=sum(user_buytime_mean[user_id])/len(user_buytime_mean[user_id])
    mean_days_since_prior_order=sum(orders_days_since_prior_order[user_id])/len(orders_days_since_prior_order[user_id])
    mean_basket_size=sum(user_order_len[user_id])/len(user_order_len[user_id])
    is_none=orders_isnone[order_id]
    
    features.append(overall_reorder_rate)
    features.append(none_order_rate)
    features.append(mean_order_dow)
    features.append(mean_order_hour)
    features.append(mean_days_since_prior_order)
    features.append(mean_basket_size)
    features.append(is_none)
    
    features_str=[]
    for feature in features:
        if(isinstance(feature, float)):
            features_str.append(str(round(feature,5)))
        else:
            features_str.append(str(feature))
            
    
    outcsv.writelines(','.join(features_str)+"\n")
    outcsv.flush()
        

create_none_features_test.py

In [2]:
from tqdm import tqdm
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import statistics
orders={}
products={}
users={}
products_aisles={}
products_departments={}
orders_isnone={}
orders_dows={}
orders_hours={}
orders_days_since_prior_order={}
train_days_since_prior_order={}
print("Start loading products.csv")
products_df=pd.read_csv('../input/products.csv', encoding = 'UTF-8')
for index, row in tqdm(products_df.iterrows()):
    product_id=str(int(row['product_id']))
    aisles_id=str(int(row['aisle_id']))
    departments_id=str(int(row['department_id']))
    products_aisles[product_id]=aisles_id
    products_departments[product_id]=departments_id
    

print("Start loading order_products__prior.csv")
#read prior orders
fr = open("../input/order_products__prior.csv", 'r', encoding = 'UTF-8')
fr.readline()# skip header
lines=fr.readlines()
for i,line in tqdm(enumerate(lines)):
    datas=line.replace("\n","").split(",")
    order_id=(datas[0])
    product_id=(datas[1])
    reorderer=int(datas[3])
    if(order_id not in orders):
        orders[order_id]={}
    orders[order_id][product_id]=reorderer

print("Start loading order_products__train.csv")
#read train orders
fr = open("../input/order_products__train.csv", 'r', encoding = 'UTF-8')
fr.readline()# skip header
lines=fr.readlines()
for i,line in tqdm(enumerate(lines)):
    datas=line.replace("\n","").split(",")
    order_id=(datas[0])
    product_id=(datas[1])
    reorderer=int(datas[3])
    if(order_id not in orders):
        orders[order_id]={}
    orders[order_id][product_id]=reorderer

for order_id in orders:
    reorder_list=list(orders[order_id].values())
    if(sum(reorder_list)==0):
        orders_isnone[order_id]=1
    else:
        orders_isnone[order_id]=0
    
print("Start loading orders.csv")
#read orders
fr = open("../input/orders.csv", 'r', encoding = 'UTF-8')

fr.readline()# skip header
lines=fr.readlines()
for i,line in tqdm(enumerate(lines)):
    datas=line.replace("\n","").split(",")
    order_id=(datas[0])
    user_id=(datas[1])
    eval_set=(datas[2])
    order_number=(datas[3])
    order_dow=int(datas[4])
    order_hours=int(datas[5])
    orders_dows[order_id]=order_dow
    orders_hours[order_id]=order_hours
    if(user_id not in users):
        users[user_id]={}
    if(eval_set=="prior"):
        try:
            days_since_prior_order=int(float(datas[6]))
            if(user_id not in orders_days_since_prior_order):
                orders_days_since_prior_order[user_id]=[]
            orders_days_since_prior_order[user_id].append(days_since_prior_order)
        except:
            pass
        users[user_id][order_number]=order_id
    elif(eval_set=="train"):
        users[user_id]["train"]=order_id
    elif(eval_set=="test"):
        users[user_id]["test"]=order_id
        days_since_prior_order=int(float(datas[6]))
        train_days_since_prior_order[order_id]=days_since_prior_order
    elif(eval_set=="valid"):
        users[user_id]["valid"]=order_id


print("Start creating features")

user_buytime_mean={}
user_buydow_mean={}
user_products={}
user_departments={}
user_aisles={}
user_none_order_rate={}
user_total_products={}
user_order_len={}
user_overall_reorder={}

for user_id in tqdm(users):
    if('test' not in list(users[user_id].keys())):
        continue
    user_buytime_mean[user_id]=[]
    user_buydow_mean[user_id]=[]
    user_products[user_id]=[]
    user_departments[user_id]=[]
    user_aisles[user_id]=[]
    user_none_order_rate[user_id]=[]
    user_order_len[user_id]=[]
    user_total_products[user_id]=0
    user_overall_reorder[user_id]=[]
    for i,(order_number, order_id) in enumerate(users[user_id].items()):
        if(order_number in ["test","train","valid"]):
            continue
        user_buytime_mean[user_id].append(orders_hours[order_id])
        user_buydow_mean[user_id].append(orders_dows[order_id])
        if(int(order_number)!=1):
            user_none_order_rate[user_id].append(orders_isnone[order_id])
        user_order_len[user_id].append(len(orders[order_id]))
        for product_id in orders[order_id]:
            user_total_products[user_id]+=1
            if(product_id not in user_products[user_id]):
                user_products[user_id].append(product_id)
            
            aisles_id=products_aisles[product_id]
            departments_id=products_departments[product_id]
            
            if(departments_id not in user_departments[user_id]):
                user_departments[user_id].append(departments_id)
            if(aisles_id not in user_aisles[user_id]):
                user_aisles[user_id].append(aisles_id)
            user_overall_reorder[user_id].append(orders[order_id][product_id])

outcsv = open("none_test_datas.csv", 'w', encoding = 'UTF-8')


cols=[]
cols.append("user_id")
cols.append("order_id")

cols.append("total_product")
cols.append("total_order")
cols.append("total_distinct_product")
cols.append("totoal_dep")
cols.append("totoal_aisle")

cols.append("order_dow")
cols.append("order_hour")
cols.append("days_since_prior_order")

cols.append("overall_reorder_rate")
cols.append("none_order_rate")

cols.append("mean_order_dow")
cols.append("mean_order_hour")
cols.append("mean_days_since_prior_order")
cols.append("mean_basket_size")

#cols.append("is_none")

outcsv.writelines(','.join(cols)+"\n")
outcsv.flush()

print("Start saving features to csv")
for user_id in tqdm(users):
    if('test' not in list(users[user_id].keys())):
        continue
    
    order_id=users[user_id]["test"]
    
    features=[]
    features.append(user_id)
    features.append(order_id)
    
    features.append(user_total_products[user_id])
    features.append(len(users[user_id])-1)
    features.append(len(user_products[user_id]))
    features.append(len(user_departments[user_id]))
    features.append(len(user_aisles[user_id]))
    
    features.append(orders_dows[order_id])
    features.append(orders_hours[order_id])
    features.append(train_days_since_prior_order[order_id])
    
    overall_reorder_rate=sum(user_overall_reorder[user_id])/len(user_overall_reorder[user_id])
    none_order_rate=sum(user_none_order_rate[user_id])/len(user_none_order_rate[user_id])
    mean_order_dow=sum(user_buydow_mean[user_id])/len(user_buydow_mean[user_id])
    mean_order_hour=sum(user_buytime_mean[user_id])/len(user_buytime_mean[user_id])
    mean_days_since_prior_order=sum(orders_days_since_prior_order[user_id])/len(orders_days_since_prior_order[user_id])
    mean_basket_size=sum(user_order_len[user_id])/len(user_order_len[user_id])
    #is_none=orders_isnone[order_id]
    
    features.append(overall_reorder_rate)
    features.append(none_order_rate)
    features.append(mean_order_dow)
    features.append(mean_order_hour)
    features.append(mean_days_since_prior_order)
    features.append(mean_basket_size)
    #features.append(is_none)
    
    features_str=[]
    for feature in features:
        if(isinstance(feature, float)):
            features_str.append(str(round(feature,5)))
        else:
            features_str.append(str(feature))
            
    
    outcsv.writelines(','.join(features_str)+"\n")
    outcsv.flush()
        

XGBoost model for None

In [ ]:
import numpy as np 
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# read datasets
train = pd.read_csv('none_train_datas.csv')
test=pd.read_csv('none_test_datas.csv')
order_id=test['order_id']

y_train = train["is_none"]
y_mean = np.mean(y_train)
       
print('Shape train: {}\n Shape test: {}'.format(train.shape,test.shape))

import xgboost as xgb

xgb_params = {
    'eta': 0.005,
    'max_depth': 6,
    'subsample': 0.8,
    'objective': 'reg:linear',
    'eval_metric': 'logloss',
    'base_score': y_mean,
    'silent': 1
}

dtrain = xgb.DMatrix(train.drop(['user_id','order_id','is_none'], axis=1), y_train)
dtest = xgb.DMatrix(test.drop(['user_id','order_id'], axis=1))
'''
cv_result = xgb.cv(xgb_params, 
                   dtrain, 
                   nfold=10,
                   num_boost_round=1500, # increase to have better results (~700)
                   early_stopping_rounds=50,
                   verbose_eval=50, 
                   show_stdv=False
                  )
'''
#num_boost_rounds = len(cv_result)
#print(num_boost_rounds)
num_boost_rounds=800
model = xgb.train(dict(xgb_params, silent=1), dtrain, num_boost_round=num_boost_rounds)
preds=model.predict(dtest)

out = pd.DataFrame({'order_id': order_id, 'none_pred': preds})
out.to_csv('test_none_pred.csv', index=False)

Futher more

Many people had share their solusion on forum, you can learn from them:

  1. 3rd-Place Solution Overview, by sjv
  2. 4-th Place Tips, by GeorgeGui
  3. 9th place Approach, by KazAnova
  4. SQL feature engineering + XGBoost (.4026 private LB, 1/2 of #100 solution), by happycube
  5. Top-30 Silver Hints, by Fred Navruzov
  6. 6th place solution overview, by Akulov Yaroslav
  7. #11 Solution, by zr
  8. 12th solution, by plantsgo
  9. 2nd Place Solution, by ONODERA

Thanks for reading, I will be happy if this kernel helps someone. See you next competition!