Submitted by hcarlens t3_11kzkla in MachineLearning
I run mlcontests.com, a website that aggregates ML competitions across Kaggle and other platforms.
I've just finished a detailed analysis of 200+ competitions in 2022, and what winners did (we found winning solutions for 67 competitions).
Some highlights:
- Kaggle still dominant with the most prize money, most competitions, and most entries per competition...
- ... but there are 10+ other platforms with interesting competitions and decent prize money, and dozens of single-competition sites
- Almost all competition winners used Python, 1 used C++, 1 used R, 1 used Java
- 96% (!) of Deep Learning solutions used PyTorch (up from 77% last year)
- All winning NLP solutions we found used Transformers
- Most computer vision solutions used CNNs, though some used Transformer-based models
- Tabular data competitions were mostly won by GBDTs (gradient-boosted decision trees; mostly LightGBM), though ensembles with PyTorch are common
- Some winners spent hundreds of dollars on cloud compute for a single training run, others managed to win just using Colab's free tier
- Winners have largely converged on a common toolkit - PyData stack for the basics, PyTorch for deep learning, LightGBM/XGBoost/CatBoost for GBDTs, Optuna for hyperparam optimisation.
- Half of competition winners are first-time winners; a third have won multiple comps before; half are solo winners. Some serial winners won 2-3 competitions just in 2022!
Way more details as well as methodology here in the full report: https://mlcontests.com/state-of-competitive-machine-learning-2022?ref=mlc_reddit
Most common Python Packages used by winners
When I published something similar here last year, I got a lot of questions about tabular data, so I did a deep dive into that this year.People also asked about leaderboard shakeups and compute cost trends, so those are included too. I'd love to hear your suggestions for next year.
I managed to spend way more time on this analysis than last year thanks to the report sponsors (G-Research, a top quant firm, and Genesis Cloud, a renewable-energy cloud compute firm) - if you want to support this research, please check them out. I won't spam you with links here, there's more detail on them at the bottom of the report.
backhanderer t1_jb9ph2v wrote
Thanks for this. I knew PyTorch was dominant but didn’t realise it was this dominant for deep learning!