PROPOSAL

Estimating the carbon footprint of Kaggle classification challenges

Supervisors: Veronika Cheplygina
Semester: Fall 2022
Tags: machine learning, medical imaging, data analysis, resource consumption

Machine learning models, especially larger models that are used in for example image or text datasets, can be expensive to train. During development models are usually trained multiple times for example to optimize hyperparameters, which can result in a large carbon footprint.

This project specifically focuses specifically on medical data. There are some recent efforts, for example [https://arxiv.org/abs/2203.02202], trying to quantify the carbon footprint of papers published in the community.

The goal in this project is to do a similar quantification for Kaggle competitions, such as the famous lung cancer challenge with a 1 million USD prize [https://www.kaggle.com/competitions/data-science-bowl-2017], where almost 2000 teams competed for the prize.

The carbon footprint can then be compared to the “practical significance” of the performance improvement of the winner, see for example [https://www.nature.com/articles/s41746-022-00592-y].

The project is suitable for a BDS thesis project or the KDS research project, with possibilities to continue in this area for a thesis project.