Pages

Wednesday, March 5, 2014

Top Ten Reason To "Kaggle"



Do you aspire to do Machine Learning, Data Science, or Big Data Analytics?  If so, you have probably studied, taken courses, read a bunch of blog posting and can code up some R, Python or Matlab.  

Are you ready to start solving real world problems?  Probably not.    It is one thing to know some things about data, it is a very different situation altogether to effectively solve real world problems.   So how do you improve your skills?

I highly recommend you take a look at Kaggle Competitions.  Kaggle.com hosts Data Science/Machine Learning competitions on their site.  They offer a wide range of challenging problems with a fixed deadline and with the element of competition.  Also, there is usually a modest financial incentive for the winner(s), although I am surprised at how meager most of the prizes are considering the amount of work invovled and the benefit they derive from crowdsourcing their problems.  But the real benefit is not in the prize money, it is in the learning process.

Kaggle has been a tremendous learning experience to expand my depth and breadth of knowledge.  Here's why:

1. Kaggle exposes you to a wide range of Machine Learning problems: Forecasting, Sentiment Analysis, Natural Language Processing, Image Recognition, etc.    This motivates you learn about as much as you can about the problem domain, the type of data involved, and the various algorithms which might be applicable. 

2. Kaggle is under a time limit.  This "forces" you to work in a very efficient manner in developing and testing out alternative ideas quickly.   When under pressure and motivated to score highly on a competition, you will focus and learn more techniques in a very short time frame.

3. Kaggle competitions "force" you to code and recode your solution in the most resource efficient manner possible, making tradeoffs between programmer time, CPU time, RAM etc.    In order to compete, you to need to discover and remove performance bottlenecks quickly.  This enable you to improve turnaround time for subsequent iterations.

4.  Each competitions uses a different scoring mechanisms.  You will learn about the various scoring metrics and when they are used.  You will probably code some of these yourself. 

5. You will surely learn the value of Cross-validation.  Re-sampling and retraining your model multiple times to validate that your solution is working and not overfitting the data.

6. You will learn new methods for dealing with dirty data:  Cleaning, filtering, handling missing values etc.   Sometimes the competition planners intentionally throw
garbage into the data sets in order to make the challenge harder.

7. You will sometimes be handling massive file sizes, putting you to the challenge of slicing, sampling, splitting, extracting and zipping useful subsets of the data.

8.  Each competition has a forum where competitors help each other tackle the problem.  There is a really supportive atmosphere for learning and exploring in the Kaggle forums.    At the conclusion of the competition, there is a massive learning opportunity as the participants "open their kimonos" and  share their best work for solving the problem.  The more intimate knowledge you have of the problem, the better you will understand the thought process they went through and will take notes for the next competition.

9. You will be competing against some of the best Data Scientist in the world.  This competition brings out the best you have in yourself.  If you are mediocre in your approach, it will show in your results.  Your Kaggle Leaderboard ranking is immediate feedback on how well you have broken down and solved the problem.  You can't lie to yourself, the final leaderboard shows where you stand.

10. You will come to realize there is more to machine learning than just pushing data through a library algorithm.  If all Kaggle competitors have access to the same libraries of algorithms and tools, what differentiates the solutions?   How do you win?   You can do your best work and still find 200-300 people with higher scores on the Kaggle leaderboard.   The leaderboard scoring focuses all your energy on the primary objective:  Improving the overall score of your solution.  It can be tough.  Kaggle competitors are some of the most brilliant minds on the planet.

11.  After you have scored highly in a number of competitions (A top ten finalist and a top 10% placement) you can earn the coveted "Kaggle Master" badge.

12.  Recruiters are scouring the Kaggle boards looking for talented Data Scientists.  You could find a new position. 


For all of these reasons, Kaggle works to bring out the best talent within you.     If you really want to become expert in Data Science and Machine Learning, you should consider Kaggle competitions.

(Yes, I know that was actually twelve reasons, but Ten makes a better headline  :-)


Here is a screen shot of the Loan Default Prediction Leaderboard