In this blog, I am introducing you to Kaggle, a company that incentivizes the best data scientists in the world to examine your data and help solve your problems.
I first learned about “Data Mining Competitions” from my good friend and +X PRIZE Foundation Trustee Rob McEwen. Unlike me, Rob owns gold mines (yup, gold mines) — a lot of them. And in 1998 he wanted to understand how much gold was in one of his particular mines, but his top scientists looking at the geological data couldn’t tell him. So, he took all of his secret data (normally kept in the safe) and put it up on the web for the world to see.
Next he put up a $500,000 prize and asked data scientists worldwide to analyze this data and show him where he could find his next 6 million ounces of gold. The data scientists took the bait and the competition was on. Rob had 1,400 people download the data and 125 entries. As it turns out, the top three winners (none of whom, by the way, ever physically traveled to visit his mine) showed him where to locate those 6 million ounces of gold. A $500,000 purse netted him some $3 billion in value in just one year. Now THAT was leverage!
The questions I put to you, are these:
Do you have a lot of data and a problem that you’d like to challenge 45,000 data scientists to help you solve?
Do you want to find a way to identify whether a song will be a hit?
Do you want to determine whether photos submitted to your website are any good?
Do you want to advance research in HIV treatment?
Do you want to discover where the universe is hiding its mysterious dark matter?
A machine-learning, data-competition platform called Kaggle can help solve these problems. In fact, Kaggle has actually solved these challenges.
To get all the details, I interviewed Kaggle’s President and Chief Scientist Jeremy Howard. Jeremy is a brilliant data scientist himself. He’s an entrepreneur with a thick Australian accent and a background in philosophy and management consulting who’d built and sold startups. His first affiliation with Kaggle was when he competed successfully in Kaggle’s early contests. So successful was he in these competitions — including one that was trying to replace the 50-year-old ranking system for chess matches — and so enamored was he of the Kaggle platform that after running into founder and CEO Anthony Goldbloom, a fellow Aussie, Jeremy joined the team and moved to San Jose.
I caught up with Jeremy at +Singularity University for this interview.
“So what is Kaggle?” I asked him. “Kaggle is a new kind of company, which is creating a whole new way of doing work by leveraging the most powerful tools out there — machine learning and artificial intelligence,” he said. “Kaggle has built a platform that allows you to get access to more than 45,000 data scientists to help you with your problems. Throw away your preconceived ideas and think about what ways you can potentially transform your business by leveraging machine learning. Kaggle is a marketplace. All the best marketplaces bring together two groups that are looking for each other. In Kaggle’s case we’re bringing together people with interesting problems to solve and lots of data to mine, and tens and tens of thousands of data scientists — many the best in the world — who enjoy a challenge, who want to look at your data and figure out what’s hiding in there.”
For those of you who never heard of the term “machine learning,” Jeremy provided a quick description on this as well. “Machine learning is the ability for machines (computers) to come up with ways of solving problems themselves just by looking at some data you give them about some causes and effects,” he explained. “In the long term it’s my strong belief that development of machine learning will eventually lead to strong artificial intelligence.”
“Most of the data scientists who compete are either with major universities or running research departments, these guys aren’t available for you to easily hire.” Jeremy went on to explain that many are scientists who work at companies like Google, Facebook, LinkedIn, Microsoft and Apple during the day. “On evenings and weekends they compete on Kaggle.” To date there have been about 100 so Kaggle competitions since its founding in 2010.
“The interesting thing is that very few of these people who are winning Kaggle competitions are first and foremost machine-learning researchers or experts in the particular field we are trying to solve,” Jeremy says. “The teams competing on Kaggle are typically analyzing data to try and solve problems on subjects like glaciers or particle physics or electrical engineering.”
What’s powerful is how these people can channel their unique expertise in one area into another. For example, one NASA-funded Kaggle competition involved the search for dark matter — that elusive material whose existence has been suspected for decades but has never been found.
“People have been looking for that stuff for some time,” Jeremy said. “Now, there’s really one thing we know about dark matter, which is that it has gravitational pull. And the one thing that we do know about gravitational pull is that it can actually bend light. So some pretty smart researchers realized a while ago that if we look at really distant objects in the sky, distant galaxies, we should be able to detect dark matter by seeing the light from those distant objects being bent, it skews the light from these galaxies. All you need to do to create a universal map of dark matter is to create a universal map of these galaxies, and figure out how much their light has been skewed. And that way you know where this dark matter is and you have a map of it. Now coming up with an algorithm to do this had been attempted for three decades, but nothing significant had been developed.”
To solve the dark matter mapping challenge, in the Kaggle competition +NASA put all of its galactic observation data online asking scientists to come up with an improved algorithm to map the dark matter. “Within three days of launching the competition, our competing teams basically smashed all past research efforts,” said Jeremy.
What was also remarkable was the source of early breakthrough algorithms. “It didn’t come from an astronomer or astrophysicist, instead it came from a guy named Martin O’Leary, who studies the movements of glaciers at Cambridge University,” Jeremy said. “He’s a glaciologist who had developed algorithms to correct for atmospheric refraction and pixelation of glaciers’ images taken from Earth orbiting satellites. And when he applied his learning to distorted galactic images, he saw a real improvement over previous algorithms.”
“By the end of three months, 15 teams had surpassed all previous NASA research, all using different approaches. In fact, some particle physicists ended up winning this competition. Their best result was over 300 percent more accurate than NASA’s previous best algorithms. All 15 groups went to the +NASA Jet Propulsion Laboratory and worked together with NASA in actually implementing this, dark-matter-mapping algorithms,” Jeremy said.
When I asked Jeremy what problems he thought Kaggle would be most useful for solving, especially for an entrepreneur, here’s the list of his top 5 ideas:
1. Helping an entrepreneur start a business: Kaggle can help you analyze the data of a particular industry to see where openings might exist for new products.
2. Sorting through visual data more quickly than humans. “We held a competition that allowed a company to develop an algorithm that would actually predict which user-generated pictures were more ‘beautiful’ than others.”
3. Tapping into a variety of customizable data products. “One such is the news-aggregation site Prismatic,” Jeremy said, “which curates news from all over the Web and then uses machine intelligence to predict which articles you’re likely to enjoy and curates a kind of newspaper for you each day that it thinks you’re going to like.”
4. Seeing where technology can drive innovation in certain industries, such as automotive. Kaggle worked with Ford, for example, “to identify a system for cars that would automatically identify if you were getting drowsy, or even not alert,” Jeremy said. “They had lots of data about vehicle sensors and physiological data and so forth and they wanted a predictive model. They put that data up on Kaggle and within days there were people who were solving that problem. None of them had a background in vehicle safety, or vehicle sensor systems.”
5. Identifying how and where machine learning can transform a new or existing business. This is where a new capability called “Kaggle Prospect” comes in. In Kaggle Prospect, “you can run a competition where people come up with ideas on what data competitions to run using your data. So you say, ‘Here’s our data, here’s a snapshot of roughly how our business works,’ and data scientists who actually understand machine learning come back to you with different ideas on what insights, solutions and improvements they can extract from your data.”
After spending the afternoon with Jeremy, here are my top three takeaways regarding Kaggle competitions:
1. People compete for the challenge, not the money. Most of the data scientists who compete do so in their spare time; they’re working at universities or running research departments.
2. People compete to learn about their process. “If you do really well then you learn that this is an algorithm which is amongst the best in the world, if you don’t do so well, you’ll find out where your gaps are,” Jeremy says.
3. Leaderboards are a great way to spur competition. “I used to compete myself,” Jeremy says, “and I found that there were many situations where I’d thought I’d done the best that I’d possibly could do, but getting passed on the leader board made me find things I didn’t know were in me.”
In my next blog I’m going to continue writing about Kaggle, but this time I’m going to show you how the platform helped a large insurance company realize billions of dollars in benefits, all for a $10,000 prize.
NOTE: As always, I would love your help in co-creating BOLD, and will happily acknowledge you as a “contributing author” for your input. Please share with me (and the community) in the comments below what you specifically found most interesting, what you disagree with and any similar stories or examples that reinforce this blog that I might use as examples in writing BOLD. Thank you!