PROPOSAL
Estimating the impact of Kaggle classification challenges
Machine learning models, especially larger models that are used in for example image or text datasets, can be expensive to train. During development models are usually trained multiple times for example to optimize hyperparameters, which can result in a large carbon footprint.
This project specifically focuses specifically on medical data. There are some recent efforts, for example by Selvan et al, trying to quantify the carbon footprint of papers published in the community.
The goal in this project is to do a similar quantification for Kaggle competitions, such as the famous lung cancer challenge with a 1 million USD prize, where almost 2000 teams competed for the prize. From a BDS thesis project completed in Spring 2023, we already have some initial results about this.
The carbon footprint can then be compared to the significance of the performance improvement of the winner. For example in a recent paper we show that often, the top algorithms for a challenge have performance differences that are smaller than the evaluation noise (in other words, if the data was split differently, the ranking of the algorithms would change). More generally, it could be relevant to measure other metrics, on how the proposed algorithms affect healthcare, in terms of hours saved by clinicians, improved quality of life for patients, and so on.
Multiple projects are possible, groups of 2 preferred