Data Mining Cup 2022

The goal of the project was to predict user-based replenishment of a product based on historical orders and item features. For a predefined subset of user and product combinations, we should predict if and when a product will be purchased during the prediction period. The time period for the data ranged from 01.06.2020 to 31.01.2021. The prediction period was between 01.02.2021 and 28.02.2021. The prediction column had to be filled like this:

0 – no replenishment during that period
1 – replenishment in the first week
2 – replenishment in the second week
3 – replenishment in the third week
4 – replenishment in the fourth week

We approached this by doing the following:

Check the given data and engineer some features
Clean the data
Transform the data
Compare different models
Check if chosen model makes reasonable predictions and adjust accordingly

Tools and libraries we used during the project:

Python including Pandas, Numpy, Matplotlib, Scikit-learn and XGBoost
Knime

Models we tested:

XGBoost (regression & classification)
Linear Regression
Random Forest

Our approach:
After testing different models, we decided to go with linear regression. After the models prediction we took a look at the results and realized that some predictions that should have landed in the prediction period did not. This happened for example, if a customer purchased the product regularly and the mean time between purchases was small e.g., 3 weeks. We then calculated the delta from the date where a customer could make the last purchase (31.01.2021) to his last purchase. If the delta was smaller than ½ times the standard deviation plus the mean time between purchases, we added the mean time between purchases on top of the models prediction.

Summary:
In the end we finished in 16th place out of 78 participants. It was a great learning experience, where we learned a lot about different data science techniques and that solving real-world problems is a lot harder than it seems at first sight.

What I've learned

Technical constraints can be circumvented, through intelligent data transformation (or more compute power ;-))
It's beneficial to test different methodologies early in the project
Always check results for plausibility
Using machine learning libraries like Scikit-Learn

Data Mining Cup 2022

Description

The data

Multi Hot Encoding

Data Wrangling

Workflow

Competition Ranking