In the world of data science, there are few honors higher than becoming a Kaggle Grandmaster.
For those new to the field, Kaggle is the largest online community for data scientists. The platform has well over 1 million users, hosts datasets, competitions on machine learning, and provides educational resources for those looking to improve their skills. Kaggle’s competitions have grown from simple exercises into elite affairs with monetary prizes, sometimes only open to top-rated users.
According to Nvidia data scientist and Kaggle Grandmaster Bojan Tunguz, to achieve the title of Grandmaster, one must get to know machine learning and predictive analytics better than almost anyone else, while constantly reevaluating hypotheses to win the biggest competitions.
The Sarajevo, Bosnia born Tunguz began using Kaggle after he had transitioned away from studying physics, using the platform as a way to get to know data science. BNow, he’s the first to break into the Top 10 in all four of Kaggle’s categories after entering countless competitions (and winning two).
Tunguz discussed via email what it takes to become a Kaggle Grandmaster, how physics prepared him for a career in data science, and how he envisions the future of machine learning.
The Business of Business: You just made Kaggle’s Top 10 in every category. What skills does it take to become a Kaggle Grandmaster? What was that journey like?
Bojan Tunguz: There are four different categories in which Kaggle awards progression and ranking points: competitions, datasets, notebooks (kernels) and discussion. They each require a distinct set of skills, although the one overarching theme is, of course, the mastery of various aspects of data science and machine learning.
The competition Grandmaster tier is, by far, the hardest one to achieve. To get to that level you have to acquire a really high level of mastery and deep understanding of machine learning and predictive analytics. You also need to develop a way to properly analyze and evaluate your models, be able to learn quickly, and brutally prioritize what you focus on. Being able to quickly discard your own favorite hypotheses and approaches when they don’t work out is one of the main keys to your success.
The other three categories are qualitatively different, and require a different set of skills. They are not as “objective” as the competition’s success, since they are based on popularity and not some objective criterion. However, they still require a deep understanding of data science as a field — which methods and libraries to use, what information the community may find helpful (and when!), what are some interesting datasets and how to find them, etc. These other categories also highlight and promote some “soft” skills — being able to communicate effectively and persuasively, being pedagogical, having the ability to interact with people from vastly different backgrounds, and being able to get along with others and work effectively with them, etc. These latter skills are often underappreciated, especially in cut-throat technical job environments, but in my experience they are extremely valuable, and will continue to be essential to your long-term professional advantage if you can develop and cultivate them.
What drew you into Kaggle in the first place?
Many years ago I was transitioning my career from physics into data science. For a while this seemed like an almost hopeless task — data science was not nearly as hot of a field as it is today, and there were very few inroads for an outsider to take. I had tried bolstering my skills and credentials by pursuing various online courses, bootcamps and even small consulting projects. None of those really helped me get a foot in the door with any real data science jobs.
I had heard about Kaggle ever since I got interested in data science, but for the longest time felt intimidated to give it a try. Eventually I decided to throw myself in, and from the moment I entered my first competition I was hooked! I realized that the hypercompetitive environment, far from being threatening, is probably the best way to push myself to learn new skills and grow my technical expertise. I believe that within six months of becoming active on Kaggle I had grown professionally; it was an order of magnitude more than in all of my previous work.
How did studying physics lead to your current career as a data scientist, and what did physics teach you to prepare for your current career?
This is a very good question, and one that I am still trying to find an adequate answer for. The short answer is — it didn’t. There is hardly anything explicit in my physics background and curriculum that can be directly applied to data science. I was/am a theoretical physicist, and most of my education had very little to do with the actual data — its acquisition, analysis, interpretation, etc. I like to joke that the only numbers that I had to deal with were indices of indices. :)
“Being able to quickly discard your own favorite hypotheses and approaches when they don’t work out is one of the main keys to your success.”
However, on a “meta learning” level I think there is a high-level overlap between my approach to physics and how I approach data science. The following might summarize the touching points, the way I see them:
- I treat data science as Science. This might seem like a tautology, but within the practicing community of data scientists this is a very, very contentious issue. What treating data science as Science boils down to (at least for me) is a) make the focus of your work on designing and running experiments, b) spend a lot of time analyzing and interpreting those experiments, c) exploratory mindset over exploitative mindset, d) understanding as the ultimate “deliverable”.
- Probabilistic mindset. This is the basic operating mindset that permeates almost all of “traditional” science, as opposed to the disciplines such as mathematics and computer science. In practical terms probabilistic thinking manifests itself as treating various options you have to deal with in terms of probabilistically different outcomes, and not as discrete all-or-nothing alternatives.
- Numerical intuition. This is cognitively different from logical thinking, and even general mathematical skills. Numerical intuition can help you assess without using explicit reasoning if your model training is going well, which models to combine, which combination of weights to use for your ensembling, which hyperparameters to try out, and if your predictions can pass a “smell test”.
- Being able to think in high dimensional vector spaces. Even though I rarely, if ever, explicitly use this way of thinking when working on Machine Learning problems, it is something that is always latently present in all of my work.
- High-level modeling. When looking from the abstract modeling perspective, what I now do in data science and what I used to do in theoretical physics has not changed at all: I come up with sophisticated models that describe some aspects of reality. I used to do modeling of the fundamental physical reality with higher-level math, and now I model various interesting real-world phenomena with code. But the high-level principles are the same.
You’ve won some of the biggest competitions on Kaggle. How do you prepare for a competition, and what’s it like to win?
So far I have won two Kaggle competitions, one with an amazing team, and one solo. The one I won with a team was the biggest Kaggle competition up to that point, with over 7,000 teams, but since then it has been eclipsed by an even larger competition. The nature of Kaggle competitions has changed dramatically over the past couple of years, and it is very likely that those will remain the top two competitions for the foreseeable future.
The best way to prepare for a Kaggle competition is to do a lot of Kaggle competitions. :) Like with any other competitive endeavor, the experience of competing itself gives you a lot of preparation for future contests. The next best thing to competing a lot is to go over the past competitions, and try to see if there are any similarities with the one that you decide to enter. Read past solutions and go over the past notebooks and code. When you enter the competition make sure that you keep abreast with the forum discussions and the notebooks — they often reveal a lot of useful information and hints.
What developments in machine learning fascinate you right now?
I have been downright floored with the explosion of research and developments in the NLP over the past two years. A few years ago I was a member of a team on Kaggle that finished third in a very interesting Toxic Comments classification competition. At that time I thought I was getting a pretty good handle on NLP. However, with the advent of transformers, and explosion in datasets and computational resources that the best NLP models have been utilizing, the field has progressed at a breakneck pace. The scary part is that I believe we are still only scratching the surface of what is possible, and over the next few years (if not much sooner!) we may witness some epochal developments in this area.
"If you approach Kaggle as a learning experience, and not a proving ground for your already high-level expertise, then you should prepare for a long journey with a lot of setbacks and disappointments."
My own ML interests have been shifting towards exploring applications of ML for tabular data. I feel that the field has been somewhat stagnant as compared to what has been happening with Deep Learning, and I believe there are a lot of great opportunities to make an important contribution there.
Where do you see the industry in five years’ time?
I believe that in five years most companies will either become tech companies from the ground up, or be relegated to niche sectors. Technology will enable full digital transformation to finally take shape, and with the digital transformation data-centered mindset will dominate. With the proliferation of data, the applications of ML will become ubiquitous, fully integrated in almost all technology, applications, and service.
Within ML, there will be increasing subspecialization. Days when you can work on NLP in the morning, tabular data in the afternoon, and then go to image data in the evening may be coming to an end. There will also be an increasing differentiation of specialization within the data science pipeline — expertise in data wrangling, EDA, feature engineering and “proper” modeling might become distinct professional specialization. AutoML tools will become sophisticated enough that it will completely change the workflow of many data science practitioners. The best data science work will always require good coding skills, but I believe we’ll finally get to the point where we don’t assume that data scientists are a subset of software engineers.
What advice would you give to anyone who wants to be a successful Kaggler?
If you are like me and you approach Kaggle as a learning experience, and not a proving ground for your already high-level expertise, then you should prepare yourself for a long journey with a lot of setbacks and disappointments. Try to learn from the setbacks as much as you try to learn from your successes, if not even more so. Set manageable goals for yourself — a steady improvement over a long period of time will get you far. Measure your progress against your former self, rather than others. Follow the discussion and notebooks — they contain a lot of important insights and valuable information. Deep Learning, and computer vision in particular, has come to dominate Kaggle competitions, so you should be fully prepared to invest all the time that is necessary to acquire the high level of technical expertise that those sorts of ML problems require.
Finally, if you are really serious about Kaggling, you will need to invest in your own high-end hardware. Kaggle notebooks and other free online compute resources (such as Colab) can be helpful to get you off the ground, but aside from a handful of competitions these days, most Kaggle competitions require access to a lot of computational resources.