PROPOSAL

Data representativity, similarity, and diversity

Supervisors: Veronika Cheplygina
Semester: Spring 2025
Tags: machine learning, medical imaging, data analysis, meta-research

Machine learning methods are often evaluated on benchmark datasets, in computer vision, medical imaging, NLP and other fields. In such evaluation, researchers often describe the data as being:

representative, for example based on the distribution of ages of the patients mirroring the world population,
similar, for example because both dataset contain pictures of animals
diverse, for example because data was collected from multiple countries

Recent research shows, the concepts of representativity, similarity, and diversity, are not strictly defined, and different notions might be used by different researchers. Therefore, such statements can have large implications for the conclusions drawn from the research.

The goal of the project would be to create a dataset of how these concepts are represented in recent ML papers, and analyze differences for example across different years or conferences.

Groups of 2+ students (from any study program, mixed groups are welcome) preferred.

References:

Cheplygina, V. (2019). Cats or CAT scans: Transfer learning from natural or medical image source data sets?. Current Opinion in Biomedical Engineering, 9, 21-27.
Zhao, D., Andrews, J. T., Papakyriakopoulos, O., & Xiang, A. (2024). Position: Measure Dataset Diversity, Don’t Just Claim It. arXiv preprint arXiv:2407.08188.
Clemmensen, L. H., & Kjærsgaard, R. D. (2022). Data representativity for machine learning and AI systems. arXiv preprint arXiv:2203.04706.