212 Data Scientists in 10 Cities Around the World Participated in this Hackful Event. 24 Hours of Non-Stop, Fun Data Science Competition. The World's First Data Science Hackathon.

Winner Data Science Global Hacakthon: Ben Hamner (Kaggle employee, ineligible for prizes) 

My Solution to the EMC Data Science Global Hackathon

To train and recreate the winning submission (may be slightly different, as the random number generator didn’t have a static seed),

  1. Download TrainingData.csv from https://www.kaggle.com/c/dsg-hackathon/data and put it in this folder
  2. Run make_predictions.m from the Matlab command prompt
  3. Copy the resulting predictions from predictions.csv to the appropriate spreadsheet in SubmissionConversion.xls
  4. Save the submission worksheet as a new CSV file
  5. Get the Mathlab code from GitHub https://github.com/benhamner/Air-Quality-Prediction-Hackathon-Winning-Model

1st Prize Winner Data Science Global: James Petterson 

My Solution to the EMC Data Science Global Hackathon

Abstract

This document describes my solution to the EMC Data Science Global Hackathon, a 24-hour competition aimed at building an early warning system to predict levels of air pollutants.

1 Introduction

My solution consisted of 390 Generalised Boosted Regression (GBR) models, one for each combination of test position and response variable, blended with the results of one of the benchmarks provided by the competition organisers.

This article is organized as follows: Section 2 explains the features used as an input to the GBR models, the parameters of which are described in Section 3. Section 4 explains how inference was done, and Section 5 describes the final ensemble and the results.

2 Features

The organisers provided data in chunks, each one corresponding to a time slice of 11 days. In the training data each chunk was composed of up to 192 positions1, corresponding to 8 days. For each position we had several mea- surements2 and 39 response variables. The task was to predict the response

1One position corresponds to one hour. 2Measurement here means all the provided variables except responses.

1variables for 10 other test positions (1 hour, 2 hours, etc.) following each 8 days period.3

Three sets of features were used for training: Chunk means, Chunk recent and Global means.

2.1 Chunk means, Chunk recent

These were computed for each chunk and test position with the following procedure:

• get all the N training rows for the chunk (N is usually 192, but for some chunks there were less rows)

• keep only the first (ordered by position) N − k rows, where k is the test position (i.e., 1, 2, . . . 48, 72); call this set of rows D

• chunk means is the mean of all measurements and response variables in D for the hour corresponding to the test position

• chunk recent is the last4 reading of all measurements and response variables in D

So, for example, for test position 1 we are using almost all the available training data, while for test position 72 we are using only the first 5 days. This makes sense, as in the later case the algorithm should learn to predict using “older” data, as we are predicting far in the future.

2.2 Global means

This is the mean of all response variables for the hour corresponding to the test position, as in the provided benchmarks.

2.3 Bootstrapping

To increase the amount of training data the process described in 2.1 was repeated 24 times, each time removing one row of each chunk of training data. To exemplify, Table 1 shows the training rows used when building features for one chunk.

3Details here.

4The reading of the last hour, not the last available reading, so some variables may be not available.

Table 1: Given a chunk with N=192, these are the rows utilised for training for each test position, for each one of the 24 repeats of the bootstrapping procedure.

repeat #

1 1 .

1 1 2 2 .

2 2 .

24 24

3 Training

test position (k)

1 2 .

48 72 1 2 .

48 72 .

48 72

rows

1 …191 1 …190 .

1 …144 1 …120 1 …190 1 …189 .

1 …143 1 …119 .

1 …121 1…97

For each one of the 390 combinations of test positions and response variables a GBR [1] model was trained on the concatenation of the features described above, using the R GBM library of [2]. The Laplace distribution was chosen, as the goal was to minimise absolute error, and the number of trees was fixed to 20000. All other parameters were kept with their default values.

4 Inference

Inference was done separately for each response variable. Given a chunk, all features were computed as described in Section 2, except that this time all the data was used (i.e., k = 0). The correct model (based on test position and response variable) was then applied.

5 Results

The method described in the previous sections achieved a public score of 0.19574, and a private score of 0.21192.

A linear blend with one of the provided benchmarks (average per hour and chunk) was also attempted (95% of GBR and 5% of the benchmark). This resulted in a small improvement, with a public score of 0.19515 and a private score of 0.21143.

Both submissions were selected for the final standings.

References

[1] J. H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 29:1189–1232, 2000.

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gbm package.

1st Prize Winner Data Science London: The Londoners (Dan Harvey, Ferenc Huszar, Jedidiah Francis, Jose Miguel Fernandez Lobato

Our Solution to the EMC Data Science Global Hackathon 

(To see the complete set of files and documents of our solution click here)

1. Estimate missing values and binarize categorical variables.

For this we do

$ cd dataPreprocessing
$ R –no-save < preprocessData.R
$ cd ..
This script loads the training data from “raw_data/TrainingData.csv” and replaces the categorical variables (day of week, month, hour) by corresponding binary indicators. The script also estimates the missing values in the data using the variational matrix factorization method of

Raiko, T.; Ilin, A. & Juha, K. Principal Component Analysis for Large Scale Problems with Lots of Missing Values European Conference on Machine Learning: ECML 2007, Springer Berlin / Heidelberg, 2007, 4701, 691-698

The imputed dataset is also standardized so that each non-categorical variable has zero mean and unit standard deviation.

The input to this script is the file

“raw_data/TrainingData.csv” -> Contains the original training data.
The output of this script are the files

“dataPreprocessing/imputedDataSet.txt” -> Contains the imputed dataset.
“dataPreprocessing/meanMarginals.txt” -> Contains the marginal means of each column from “dataPreprocessing/imputedDataSet.txt”.
“dataPreprocessing/sdMarginals.txt” -> Contains the marginal standard deviation of each column from “dataPreprocessing/imputedDataSet.txt”.

2. Reduce the dimensionality of the time-series measurements.

For this we do

$ cd svd
$ R –no-save < doSVD.R
$ cd ..
This script loads the imputed training data from “dataPreprocessing/imputedDataSet.txt” and reduces the dimensionality of the last 89 columns to just 40 by means of the variational matrix factorization method described by:

Nakajima, S.; Sugiyama, M. & Tomioka, R. Global Analytic Solution for Variational Bayesian Matrix Factorization Advances in Neural Information Processing Systems, 2010

This script generates the file “svd/factors.txt” with the 40 factors that summarize the 89 previous columns and the file “svd/imputedDataSetWithFactorLoadings.txt” which contains the original imputed dataset, but with the last 89 columns replaced by the values of the 40 factor loading variables.

The input to this script is the file

“dataPreprocessing/imputedDataSet.txt” -> Contains the imputed dataset.
The output of this script are the files

“svd/factors.txt” -> Contains the 40 factors.
“svd/imputedDataSetWithFactorLoadings.txt” -> Contains the imputed dataset with the 40 factor loading variables.

3. Run Python scripts.

median_hour_month.py
For two chunks there was no test data provided. For these two chunks we couldn’t use the method we developed to regress future values against previous observations. These predictions had to be made based on knowledge of the hour and month at which the prediction has to be made.

To handle these two chunks we have computed the global median value of each target variable in each month at certain hours of the day. To make the estimates more accurate we always included data from a month before and after each month. So if we wanted to make a prediction at 10am in October, out output was the median of the target variables at 10 o’clock in September, October and November.

We used median, and not the mean, as we knew the evaluation is going to be based on mean absolute error. The empirical mean would minimise the mean squared error.

The script median_hour_month.py processes the dataset and computes these median values for any month – hour pair.

4. Binarize the test file.

For this we do

$ cd testSetPreprocessing
$ R –no-save < preprocessData.R
$ cd ..
This script binarizes the categorical variables in the test file in the same way as we did for the training file in step 1. For this, the script opens the file “python/training/submission_training.csv” and generates the file “testSetPreprocessing/binarizedTestDataSet.txt”.

The input to this script is the file

“python/training/submission_training.csv” -> Contains the test set preprocessed by the python scripts so that it contains the factor loading variables for each instance.

The output of this script is the file

“testSetPreprocessing/binarizedTestDataSet.txt” -> The same file as the one above but with the categorical variables binarized.

5. Generate a final prediction for the test set.

For this we do

$ cd fitLinearModel
$ R –no-save < generateFinalPrediction.R
$ cd ..
This script fits a linear model that predicts the current values of the factor loading variables given the previous values of those variables and the current categorical variables (day of the week, month, hour). These linear models are used to generate the final predictions for the test set.

The input to this script are the files

“imputedDataWithFactorLoadings/factors.txt” -> Contains the 40 factors that sumirze the 89 measurement variables.
“testSetPreprocessing/binarizedTestDataSet.txt” -> Contains the binarized test set.
“imputedDataWithFactorLoadings/meanMarginals.txt” -> Contains the mean of each column of the imputed training set.
“imputedDataWithFactorLoadings/sdMarginals.txt” -> Contains the standard deviation of each column of the imputed training set.
“python/training/training_*_ubertrain.csv” -> Contains the training data for the prediction task * in advance, where * is 1, 2 ,3, 4, 5, 10, 17, 24, 48, and 72.
The output of this script is the file

“fitLinearModel/finalResult.txt” -> Contains the final submission file.

6. Blending with the benchmark hourly average.

Towards the end of the competition we have reached the point where we thought the predictive power of our simple linear model cannot be improved any further, and we were quite exhausted to try anything more sophisticated. As our predictions were still only marginally better than the provided baseline and sample submission ‘submission_hour_chunk_means.csv’, we thought we may improve our predictive power by forming an ensemble of our linear model and the provided nonparametric benchmark solution.

blending.py computes a convex combination of our predictions with the supplied sample submission file ‘submission_hour_chunk_means.csv’ that we downloaded from kaggle.

We have first tried a 95% – 5% blend of our predictions with benchmark predictions. That resulted in a public leaderboard score of 0.22631, thus indeed improving on using our method alone: our solution without bleniding produced a score of 0.22896, and the benchmark solution achieved a score of 0.27532.

We have fitted a parabloa to these three observations (0,0.22896), (0.05, 0.22631) and (1,0.27532), and predicted that if the loss function was quadratic (which we knew it wasn’t, but for large number of test predictions it can be reasonably close), an optimal ensemble would blend our method with the benchmark at about 33% – 67% ratio. To avoid overfitting to the part of the test set used for the public leaderboard scores we decided to try a more conservative 25%-75% mixture, which was our final, and best submission with a public score of 0.22075.

The supplied image BlendingInterpolation.pdf shows how our last submission fits well onto a parabola predicted.