Netflix Prize and Kaggle

The Beginning

Data mining, predictive analytics, and machine learning have been greatly benefited by the popularization of competitions. The first, or at least most notable, early analytics competition is known as the “Netflix Prize”: https://www.netflixprize.com/

Netflix had the ground-breaking idea to freely distribute an anonymized copy of their consumer viewing data to “crowdsource” a better model to predict what movies/shows a customer would want to view next. The dataset included 100,480,507 ratings given by consumers to 17,770 movies. Each record consisted of a user, movie, date, and star rating (1 to 5). Netflix created a simple website to outline the rules, allow team registration, serve up the dataset, and summarize a leaderboard.

Netflix’s own existing algorithm (called Cinematch) was able to predict star ratings with a root mean squared error (RMSE) score of 1.054. In other words, Cinematch predicted the star rating you would give to each movie within about one star on average. So, if you felt a movie deserved 3 stars out of 5, Cinematch predicted that you would give the movie somewhere between 2 to 4 stars. Not perfect, but not terrible either.

Netflix offered $1,000,000 to the first team who could improve that RMSE score by 10%. Because it was believed that nobody could achieve this improvement, “progress” prizes were to be given out for $50,000 per year for the best prediction to that point as long as it made at least a 1% improvement over the prior year.

The competition began on October 2, 2006. Six days later, a team called WXYZConsulting had already beaten Cinematch’s results. One week later, three teams had beaten Cinematch—the product refined over years by Netflix’s entire team of data scientists. After progress prizes were issued in 2007 and 2008, a 30-day “last call” was issued on June 26, 2009, for final submissions because the 10% threshold had been reached. The final winning team had achieved a 10.09% improvement.

The Netflix Prize experiment was a huge success. Why? Consider the number of data scientists working at Netflix needed to create Cinematch. A recent estimate is somewhere around 30–50. Let’s say that these data scientists are making an average salary of $130,000. This means the company is paying about $190,000 per employee including taxes and benefits. That’s a cost of $9,500,000 per year (50 data scientists × $190,000) to get an RMSE of 1.054 out of Cinematch.

Netflix paid $1,000,000 over three years for a 10.09% improvement. Even if their own data scientists could have achieved that same improvement, it would have cost Netflix about $2,850,000 ($9,500,000 × 3 years × 10% improvement). But, there’s no guarantee that their team could have even achieved that improvement. Netflix basically spent $1,000,000 and got a 285% return on their “investment.”

There are several lessons to be learned from this case. First, there was nothing wrong with the data scientists at Netflix. They were simply the outcome of groupthink--a phenomenon where groups have a very difficult time recognizing good ideas that are not already contained within the group. Second, crowdsourcing is a great way to accomplish a massive amount of work or progress in a relatively short period of time. And finally, although safegauring personally identifying information is critically important, there are many other types of data (e.g. anonymized movie ratings) that can be shared to foster greater learning and good throughout the world. Netflix could have decided it was too risky to allow their competitors to have their precious data. Indeed, that was a risk. However, Netflix recognized that the value of what they could achieve with a crowdsourced competition over the risks of sharing data with competitors.

Inspiring Kaggle.com

The success of the Netflix Prize has inspired analytics competitions all over the world, including competition hubs like https://www.kaggle.com. Kaggle is a website where various organizations can post similar analytics competitions offering money prizes. However, Kaggle is much more than a competition hub. Kaggle is also an incredible learning resource because the datasets and complete submissions with code are retained after each competition and made publicly available. Kaggle has built-in code editors and discussion boards that make it incredibly easy to learn and practice with real-world datasets.

You can search through their repository to find data on most common topics: https://www.kaggle.com/datasets