Project: Predict a user’s next order

In this project you are given a dataset that describes customers' orders over time in a specific e-shop. The goal of the competition is to predict whether the user will buy products from a specific department in his next order. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.

In this specific case we are interested in predicting whether the users will buy at least one product from the department 7-"beverages" in their next order.

You may form teams of two or work individually.

Data description

The data can be downloaded from here https://www.dropbox.com/s/4492bcfmvaxfghn/data.zip?dl=0. We now give you a brief description about each data file

orders.csv: This file contains the information of all orders for the users. Some of them are used for training some for testing and some for the prior order history.

order_id: order identifier
user_id: customer identifier
order_number: the order sequence number for this user (1 = first, n = nth)
order_dow: the day of the week the order was placed on
order_hour_of_day: the hour of the day the order was placed on
days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)

products.csv:

product_id: product identifier
product_name: name of the product
aisle_id: foreign key (The aisle where the product can be found)
department_id: foreign key (The department where the product belongs). We are interested in the department with id 7 in this challenge

aisles.csv:

aisle_id: aisle identifier
aisle: the name of the aisle

deptartments.csv:

department_id: department identifier
department: the name of the department   

order_products__prior.csv: This file contains the history of orders for the users. All orders in this file took place prior to the ones that are given in the train and test files. One of the goals of this project is to extract features from this file in order to make your predictions better.

order_id: foreign key
product_id: foreign key
add_to_cart_order: order in which each product was added to cart
reordered: 1 if this product has been ordered by this user in the past, 0 otherwise


X_train.csv: This file contains the order_ids that will be used as training dataset. The data in this file are the subset of the orders.csv file that corresponds to the training data. The format is the same with orders.csv.

y_train.csv: This file contains the true labels of the order_ids that are contained in the training set. If an order actually contains at least one product from department 7(beverages) then it belongs to category 1, else 0. E.g

    order_id,category
    2620548,0
    1707550,1

X_test.csv: This file contains the order_ids that will be used for testing. The data in this file are the subset of the orders.csv file that corresponds to the test data. The format is the same with orders.csv.

Example

In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
import numpy as np


#read train test data
X_train=pd.read_csv("data/X_train.csv")
y_train=pd.read_csv("data/y_train.csv")
X_test=pd.read_csv("data/X_test.csv")
print(X_train.head(2))
   order_id  user_id  order_number  order_dow  order_hour_of_day  \
0   3110915    94104            58          3                 22   
1   2277131    18463             4          6                 21   

   days_since_prior_order  
0                     2.0  
1                    24.0  
In [21]:
X_example_train, X_example_test, y_example_train, y_example_test = train_test_split(X_train, y_train, test_size=0.33)
#random
print("random",accuracy_score(y_example_test["category"], np.random.randint(2, size=len(y_example_test))))
#Logistic regression
logreg = LogisticRegression()
logreg.fit(X_example_train, y_example_train["category"])
y_pred = logreg.predict(X_example_test)
print("logreg",accuracy_score(y_example_test["category"], y_pred))
random 0.501000769823
logreg 0.516859122402

Submission instructions

The metric that we will use in order to evaluate your predictions is accuracy. You should create a file called y_test that will contain the predictions of your model for the orders contained in the X_test file. The format of the y_test file should be exactly the same as the y_train file. For example:

order_id,category
2620548,0
1707550,1

Given the classifier we trained before we can now generate predictions for the test set:

In [23]:
print(X_test.head(2))
X_test['category']=logreg.predict(X_test)
submission=X_test[['order_id','category']]
submission.to_csv("sample_submission.csv",index=False)
   order_id  user_id  order_number  order_dow  order_hour_of_day  \
0   2401431    24698             6          1                 16   
1    198450   152822            41          4                  9   

   days_since_prior_order  
0                    15.0  
1                    13.0  

For the evaluation you have to upload the submission file to http://195.251.252.9/challengePostgrad. Teams that will not submit any solutions will not be graded. The platform will be open for submissions after 7/12/2017