In this project you are given a dataset that describes customers' orders over time in a specific e-shop. The goal of the competition is to predict whether the user will buy products from a specific department in his next order. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.
In this specific case we are interested in predicting whether the users will buy at least one product from the department 7-"beverages" in their next order.
You may form teams of two or work individually.
The data can be downloaded from here https://www.dropbox.com/s/4492bcfmvaxfghn/data.zip?dl=0. We now give you a brief description about each data file
orders.csv: This file contains the information of all orders for the users. Some of them are used for training some for testing and some for the prior order history.
order_id: order identifier user_id: customer identifier order_number: the order sequence number for this user (1 = first, n = nth) order_dow: the day of the week the order was placed on order_hour_of_day: the hour of the day the order was placed on days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)
product_id: product identifier product_name: name of the product aisle_id: foreign key (The aisle where the product can be found) department_id: foreign key (The department where the product belongs). We are interested in the department with id 7 in this challenge
aisle_id: aisle identifier aisle: the name of the aisle
department_id: department identifier department: the name of the department
order_products__prior.csv: This file contains the history of orders for the users. All orders in this file took place prior to the ones that are given in the train and test files. One of the goals of this project is to extract features from this file in order to make your predictions better.
order_id: foreign key product_id: foreign key add_to_cart_order: order in which each product was added to cart reordered: 1 if this product has been ordered by this user in the past, 0 otherwise
X_train.csv: This file contains the order_ids that will be used as training dataset. The data in this file are the subset of the orders.csv file that corresponds to the training data. The format is the same with orders.csv.
y_train.csv: This file contains the true labels of the order_ids that are contained in the training set. If an order actually contains at least one product from department 7(beverages) then it belongs to category 1, else 0. E.g
order_id,category 2620548,0 1707550,1
X_test.csv: This file contains the order_ids that will be used for testing. The data in this file are the subset of the orders.csv file that corresponds to the test data. The format is the same with orders.csv.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression import numpy as np #read train test data X_train=pd.read_csv("data/X_train.csv") y_train=pd.read_csv("data/y_train.csv") X_test=pd.read_csv("data/X_test.csv") print(X_train.head(2))
order_id user_id order_number order_dow order_hour_of_day \ 0 3110915 94104 58 3 22 1 2277131 18463 4 6 21 days_since_prior_order 0 2.0 1 24.0
X_example_train, X_example_test, y_example_train, y_example_test = train_test_split(X_train, y_train, test_size=0.33) #random print("random",accuracy_score(y_example_test["category"], np.random.randint(2, size=len(y_example_test)))) #Logistic regression logreg = LogisticRegression() logreg.fit(X_example_train, y_example_train["category"]) y_pred = logreg.predict(X_example_test) print("logreg",accuracy_score(y_example_test["category"], y_pred))
random 0.501000769823 logreg 0.516859122402
The metric that we will use in order to evaluate your predictions is accuracy. You should create a file called y_test that will contain the predictions of your model for the orders contained in the X_test file. The format of the y_test file should be exactly the same as the y_train file. For example:
order_id,category 2620548,0 1707550,1
Given the classifier we trained before we can now generate predictions for the test set:
print(X_test.head(2)) X_test['category']=logreg.predict(X_test) submission=X_test[['order_id','category']] submission.to_csv("sample_submission.csv",index=False)
order_id user_id order_number order_dow order_hour_of_day \ 0 2401431 24698 6 1 16 1 198450 152822 41 4 9 days_since_prior_order 0 15.0 1 13.0
For the evaluation you have to upload the submission file to http://18.104.22.168/challengePostgrad. Teams that will not submit any solutions will not be graded. The platform will be open for submissions after 7/12/2017