Simple way to do the validation

InfiniteWing

2017-07-30

Hello, this notebook will demonstrate a simple way to do local validation. There's a similar post Validation demo, you can also take a look on it.

Why validation

I think it's a common question, for the most important reason is to avoid overfitting. And also you can test your model and tuning parameters since their's a submission limit on Kaggle.

Create validation files

Before this step, I suggest you to take a look on some EDA notebooks. It will help you to understand the competition and data structure more quickly.

On my validation approach, I will change some 'train' label to 'valid' label in orders.csv. It is an in intuitive way and it's easy to run baseline kernel when you do validation.

In [4]:
import random

random.seed(3228)
nfold=2

#read orders
fr = open("../input/orders.csv", 'r')
fr.readline()# skip header
lines=fr.readlines()
valid_order_ids=[[] for i in range(nfold)]
print("Total {} lines in orders.csv".format(len(lines)))

fold_csvs=[]
for fold in range(nfold):
    outcsv = open("orders_valid_{}.csv".format(fold), 'w')
    outcsv.writelines("order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order"+"\n")
    fold_csvs.append(outcsv)

for i,line in enumerate(lines):
    datas=line.replace("\n","").split(",")
    order_id=datas[0]
    eval_set=datas[2]
    if(eval_set=='train'):
        randNum=random.randint(1,1000)
        for j in range(nfold):
            if(randNum<=(j+1)*1000/(nfold) and randNum>(j)*1000/(nfold)):
                valid_order_ids[j].append(order_id)
                datas[2]='valid'
                outline=','.join(datas)
                fold_csvs[j].writelines(outline+"\n")
            else:
                outline=','.join(datas)
                fold_csvs[j].writelines(outline+"\n")
        
    else:
        outline=','.join(datas)
        for j in range(nfold):
            fold_csvs[j].writelines(outline+"\n")

Create True label for each validation file

Once you want to calculate the score, you need to have the ground truth label. We only need to read 'order_products__train.csv' because the 'valid' order is transformed from 'train' order.

In [5]:
orders={}         
#read train orders
fr = open("../input/order_products__train.csv", 'r')
fr.readline()# skip header
lines=fr.readlines()
for i,line in enumerate(lines):
    datas=line.replace("\n","").split(",")
    order_id=datas[0]
    product_id=datas[1]
    reorderer=int(datas[3])
    if(order_id not in orders):
        orders[order_id]=[]
    if(reorderer==1):
        orders[order_id].append(product_id)

for fold in range(nfold):
    outcsv = open("orders_valid_label_{}.csv".format(fold), 'w')
    outcsv.writelines("order_id,label"+"\n")
    for order_id in valid_order_ids[fold]:
        if(len(orders[order_id])==0):
            orders[order_id].append('None')
        datas=[order_id,' '.join(orders[order_id])]
        outcsv.writelines(','.join(datas)+"\n")

Calculate F1 score

You can use the following code to calculate F1 score for your validation prediction. It's easy to implement on baseline kernels. All you need is to predict on 'valid' orders rather than 'test' orders. I will demo it by using all Banana ( product_id = 24852 ) as predicts.

In [7]:
from sklearn.metrics import f1_score        
for fold in range(nfold):
    f1_scores=[]
    for order_id in valid_order_ids[fold]:
        y_pred_labels=['24852'] # Use Banana as predict, you must replace it with your own prediction
        y_true_labels=orders[order_id]
        labels=list(set(y_pred_labels)|set(y_true_labels))
        
        y_pred=[]
        y_true=[]
        for label in labels:
            if(label in y_true_labels):
                y_true.append(1)
            else:
                y_true.append(0)
            if(label in y_pred_labels):
                y_pred.append(1)
            else:
                y_pred.append(0)
        score=f1_score(y_true, y_pred)
        f1_scores.append(score)
    print("F1 Score = {} for all Banana's prediction".format(sum(f1_scores)/len(f1_scores)))

Thanks for reading. If you find some mistake or I made wrong understanding, a feedback is appreciate.