212 Data Scientists in 10 Cities Around the World Participated in this Hackful Event. 24 Hours of Non-Stop, Fun Data Science Competition. The World's First Data Science Hackathon.

(interviews as originally posted in Kaggle blog)

EMC DATA SCIENCE GLOBAL COMPETITION- INTERVIEW WITH THE WINNERS

Winner: Ben Hamner. Ineligible for prizes (Ben is a Kaggle employee)

What was your background prior to entering this challenge? I graduated from Duke University in 2010 with a bachelors in biomedical engineering, electrical and computer engineering, and mathematics. For the next year, I applied machine learning to improve non-invasive brain-computer interfaces as a Whitaker Fellow at EPFL. On the side, I participated in or won a number of machine learning competitions. Since November 2011, I have designed and structured a variety of competitions as a Kaggle data scientist.

What made you decide to enter? I was hanging out at Splunk (one of the SF venues hosting the hackathon). Anthony asked me some questions about extracting features from the data, which prompted me to open it up and look at it in the afternoon.

What preprocessing and supervised learning methods did you use? I took the lagging N components from the full time series (N=8 for the winning submission, which was selected arbitrarily) as features, then each of the 10 prediction times and 39 pollutant measures as targets. I then trained 390 Random Forests over the entire training data, one for each predicted offset time-pollutant combination. The Random Forest parameters were selected so that the models would be quick to train. The code for creating the winning model is available here.

Some straightforward approaches to improving this model include

  • Optimizing the parameters for model performance as opposed to training time
  • Directly optimizing for the error metric (mean absolute error) instead of RMSE
  • Using a data-driven approach to select the number of previous time points to include

What was your most important insight into the data? I don’t believe I had any specific insights on the data – I barely looked at it before training the model.

Were you surprised by any of your insights? I was surprised that domain insight wasn’t necessary to win the hackathon. Key insights have been crucial in many of our longer-running competitions.

Which tools did you use? Once I decided to fiddle with the data, I asked David (a fellow Kaggle data scientist) to pick a random number between one and three. He picked two, and I used MATLAB. (If he said one I would have used R, and three would have been Python).

What have you taken away from this competition? Taking all the features and chucking them into a Random Forest works surprisingly well on a variety of real-world problems. This is demonstrated more empirically in this paper. I’m very interested in domains such as CV and NLP where this doesn’t hold true, or where the problem can’t be simply formulated in the standard supervised machine learning framework.

What did you think of the 24 hour hackathon format? It was a lot of fun! I especially enjoyed seeing Kagglers in venues all over the world collaborating and competing on this problem. I’m curious to see how much better the results would be if we ran this as a standard competition over a couple months, and whether the work in the first day would comprise the majority of the improvement over the benchmark.

1st Prize Winner EMC Data Science Global: James Petterson

What was your background prior to entering this challenge? I am currently finishing my PhD in machine learning at ANU. Before that I worked as a software engineer for the telecom industry for many years.
What made you decide to enter? The challenge of kaggle competitions always attracted me – I took part in two other ones in the past (What Do You Know and Heritage Health Prize). I was abstaining from entering new ones as I know how time consuming this can be, but when I heard about this 24h one I couldn’t resist.
What preprocessing and supervised learning methods did you use? I computed a set of training instances based on:
- mean of all variables for each prediction time
- mean of all variables for each prediction time and chunkID
- most recent value of all variables for each chunkID
I did some bootstrapping to increase the size and variety of the training data, using a 24-hour moving window. I then trained 390 Generalised Boosted Regression models, one for each combination of target variable and prediction time.

What was your most important insight into the data? I didn’t spent much time looking at the data, so I can’t think of any particular insight.

Were you surprised by any of your insights?  Iwas surprised that I had a good result without spending much time trying to understand the data. I suspect that wouldn’t be the case in a longer competition, though.
Which tools did you use? Only R.
What have you taken away from this competition? I saw once again how powerful boosting methods are. Even though this was essentially a time series problem, a standard boosting regression method performed quite well.

What did you think of the 24-hour hackathon format? Normally competitions take 3 months or more, which tends to favour those that can spend more time on them. The 24-hour format was great in the sense that it gave a chance to those that are more time constrained. And, of course, it was a lot of fun!

I hope we will have more of these in the future.
1st Prize Winner EMC Data Science London: The Londoners (Dan Harvey, Ferenc Huszar, Jedidiah Francis, Jose Miguel Fernandez Lobato)
Interview coming soon