Machine Learning -- DASYA

PROPOSAL

Data Attribution on Progressive Datasets for Deep Learning

Deep convolutional networks are able to learn representation of images, scoring well in tasks such as image classification and object detection. During model training, these networks have the ability to process different input sizes without requiring changes to their architecture. In this project, we would like to investigate the effects that changing input sizes has on these kinds of models. We …
Supervisors: Pınar Tözün, Ties Robroek
Semester: Fall 2025
Tags: data attribution, deep learning, machine learning, resource efficiency

PROPOSAL

Efficient Data Selection Methods for Machine Learning

Today’s foundation models are trained on vast amounts of data. The quality and size of this data has a huge impact on the accuracy of these models. Selecting the right amount and variety of data for a given task, however, is a resource-intensive process. In this project, which is part of a larger collaboration, we would like to expand our investigation of state-of-the-art data selection mechanisms …
Supervisors: Pınar Tözün, Ties Robroek
Semester: Fall 2025
Tags: data selection, deep learning, machine learning, resource efficiency

PROPOSAL

Good or Bad: LLM-Generated Datasets

(MSc Research Project / MSc Thesis) The goal of this project is to research how AI and large language models generate datasets. Research questions include: Where does the generated data come from? Are sources available on the internet or can they be found? What biases exist in the generated data? And how much of the data is simply wrong? Generated datasets are used in many fields in practice, …
Supervisors: Martin Hentschel
Semester: Fall 2025
Tags: training data, machine learning, LLMs

PROPOSAL

Learning-to-rank methods for query optimization

Query optimization is crucial for any data management system to achieve good performance. Recent advancements in AI have led academia and industry to investigate learning-based techniques in query optimization. In particular, many works propose replacing the cost model used during plan enumeration with a machine learning model (typically a regression model) that estimates the runtime of a query …
Supervisors: Zoi Kaoudi
Semester: Fall 2025
Tags: machine learning, database, query optimization, ranking

PROPOSAL

Training data generation for data management systems

Query optimization is crucial for any data management system to achieve good performance. Recent advancements in AI have led academia and industry to investigate learning-based techniques in query optimization. In particular, many works propose replacing the cost model used during plan enumeration with a machine learning model that estimates the runtime of a plan. However, to build such a model …
Supervisors: Zoi Kaoudi
Semester: Fall 2025
Tags: machine learning, training data, query optimizer

PROPOSAL

Data representativity, similarity, and diversity

Machine learning methods are often evaluated on benchmark datasets, in computer vision, medical imaging, NLP and other fields. In such evaluation, researchers often describe the data as being: representative, for example based on the distribution of ages of the patients mirroring the world population, similar, for example because both dataset contain pictures of animals diverse, for example …
Supervisors: Veronika Cheplygina
Semester: Spring 2025
Tags: machine learning, medical imaging, data analysis, meta-research

PROPOSAL

Machine learning on optical fiber sensor data

Optical fiber is the backbone of the internet’s communication, e.g. in the form of submarine fiber cables. It can also be employed as a sensor device, by means of combined opto-acoustic methods such as Distributed acoustic sensing (DAS) or State of Polarisation (SoP) sensing. Fiber is cabapble of sensing all kinds of vibrational/acoustic events, from animal sounds over seismic activity to …
Supervisors: Sebastian Büttrich
Semester: Fall 2025
Tags: fiber, acoustics, audio, machine learning, DAS, SOP

PROPOSAL

Learning-based image quality enhancement on CubeSat

The DISCO-2 project is driven by students and aims to develop and deploy a 3-unit CubeSat into low Earth orbit. Its mission focuses on conducting Earth observations over Greenland and supporting various research objectives. The satellite has three cameras onboard: infrared, wide-angle, and standard (main camera). Due to the limitations of the imaging hardware and the challenging conditions on the …
Supervisors: Yucheng Lu, Julian Priest
Semester: Fall 2024
Tags: Image enhancement, Image processing, Machine learning

PROPOSAL

Debiasing medical image datasets to improve machine learning robustness and fairness

It has been observed that deep learning models are able to identify patient characteristics such as age, sex, and self-reported race with high accuracy from medical images such as chest x-ray recordings, even when medical doctors cannot. This raises the potential for such models to learn to (falsely) diagnose patients of different demographics differently, even if they present with the same …
Supervisors: Amelia Jiménez-Sánchez, Eike Petersen, Veronika Cheplygina
Semester: Fall 2024
Tags: machine learning, data science, medical imaging

PROPOSAL

Concept Bottleneck Models to detect hidden features or avoid memorization

Concept Bottleneck Models [1] are designed to leverage high-level concepts. They revisit the classic idea of first predicting concepts that are providing at training time, and then using these concepts to predict the label. By construction, it is possible to intervene on these concept bottleneck models by editing their predicted concept values and propagating these changes to the final prediction. …
Supervisors: Amelia Jiménez-Sánchez
Semester: Spring 2025
Tags: machine learning, data science, medical imaging

Tagged with: Machine Learning

PROPOSAL

Data Attribution on Progressive Datasets for Deep Learning

PROPOSAL

Efficient Data Selection Methods for Machine Learning

PROPOSAL

Good or Bad: LLM-Generated Datasets

PROPOSAL

Learning-to-rank methods for query optimization

PROPOSAL

Training data generation for data management systems

PROPOSAL

Data representativity, similarity, and diversity

PROPOSAL

Machine learning on optical fiber sensor data

PROPOSAL

Learning-based image quality enhancement on CubeSat

PROPOSAL

Debiasing medical image datasets to improve machine learning robustness and fairness

PROPOSAL

Concept Bottleneck Models to detect hidden features or avoid memorization