Here you can see a list of all currently proposed projects. For a list of all previous proposals, see the proposal archive
Machine learning methods are often evaluated on benchmark datasets, in computer vision, medical imaging, NLP and other fields. In such evaluation, researchers often describe the data as being:
representative, for example based on the distribution of ages of the patients mirroring the world population, similar, for example because both dataset contain pictures of animals diverse, for example …
Supervisors:
Veronika Cheplygina
Semester: Spring 2025
Tags: machine learning, medical imaging, data analysis, meta-research
In medical imaging, multi-task learning can be used to train a model that jointly predicts both a diagnosis, and other patient characteristics, such as demographic variables. Among others, this strategy has frequently been used for diagnosis of Alzheimer’s from brain MR scans, with age as an additional variable, see Zhang et al as an example. The idea is that both the disease, and age, …
Supervisors:
Veronika Cheplygina
Semester: Spring 2025
Tags: machine learning, medical imaging, data analysis, fairness
Machine learning methods for medical imaging, for example segmentation of skin lesions or classification of lung cancer, are often evaluated on benchmark datasets such as ISIC, CheXpert, MIMIC-CXR and so forth. In such evaluation, researchers often compare the methods they propose, to state-of-the-art methods in the field, and report various performance metrics such as Dice score, AUC etc.
Due to …
Supervisors:
Veronika Cheplygina
Semester: Spring 2025
Tags: machine learning, medical imaging, data analysis, meta-research, reproducibility
Spectral learning priority is a useful tool in analyzing a model’s focus during training, it describes how a model may understand a given image from the spectrum perspective. For example, to distinguish cats and tortoises, learning to recognize their shapes would be enough, such embedding will result in higher learning priority at low frequencies representing shapes; while learning to …
Supervisors:
Yucheng Lu, Veronika Cheplygina
Semester: Fall 2024
Tags: Spectral analysis, Image classification, Medical imaging
It has been observed that deep learning models are able to identify patient characteristics such as age, sex, and self-reported race with high accuracy from medical images such as chest x-ray recordings, even when medical doctors cannot. This raises the potential for such models to learn to (falsely) diagnose patients of different demographics differently, even if they present with the same …
Supervisors:
Amelia Jiménez-Sánchez, Eike Petersen, Veronika Cheplygina
Semester: Fall 2024
Tags: machine learning, data science, medical imaging
There have been several situations where machine learning classifiers, trained to diagnose a particular disease (for example, lung cancer from chest x-rays), overfit on hidden features within the data. Examples include gridlines, surgical markers or evidence of treatment or text present in the images (see references for examples). This causes the classifier to fail on other type of images. …
Supervisors:
Veronika Cheplygina, Amelia Jiménez-Sánchez
Semester: Spring 2025
Tags: machine learning, data science, medical imaging
Open-source JavaScript applications, such as browser-based web games, are typically developed by individual software engineers or small teams. These teams often have limited financial resources to use commercial logging frameworks and cloud-based analysis systems and may also lack knowledge and expertise in logging. However, log analysis is highly important for many reasons: monitoring application …
Supervisors:
Martin Hentschel
Semester: Fall 2024
Tags: open source, performance
The Deconstructed Cloud Databases project stems from a simple question: What are the minimum components required to build a data management system in the cloud? Our motivation for this project is based on the idea that reducing a system to its minimum set of components makes it easier to build, test, and maintain cloud data management systems. This approach requires less engineering effort, …
Supervisors:
Martin Hentschel
Semester: Fall 2024
Tags: data management, performance, benchmarking, hacking
The Deconstructed Cloud Databases project stems from a simple question: What are the minimum components required to build a data management system in the cloud? Our motivation for this project is based on the idea that reducing a system to its minimum set of components makes it easier to build, test, and maintain cloud data management systems. This approach requires less engineering effort, …
Supervisors:
Martin Hentschel
Semester: Fall 2024
Tags: data management, security, open source, open standards
Are you interested in working with a big data open source project?
You are welcome to conduct your thesis/project in Apache Wayang. Apache Wayang is the first cross-platform framework that allows users to specify their task/query in a system-agnostic manner and Wayang will determine which is the best system(s) to execute this task with the goal of optimizing performance. For a general overview …
Supervisors:
Zoi Kaoudi
Semester: Fall 2024
Tags: big data, database, cross-platform data processing, open source, Apache
Knowledge graphs (KGs) are extensively used in many application domains, such as search engines, product recommendation, and bioinformatics. Knowledge graph completion (a.k.a.~link prediction), i.e.,~the task of inferring missing information from knowledge graphs, is a widely used task in the above applications. This project will investigate how to loosely-couple the data-driven power of knowledge …
Supervisors:
Zoi Kaoudi
Semester: Fall 2024
Tags: knowledge graph, LLMs, reasoning
Are you interested in working with a big data open source project and help the environment?
You are welcome to conduct your thesis/project in Apache Wayang. Apache Wayang is the first cross-platform framework that allows users to specify their task/query in a system-agnostic manner and Wayang will determine which is the best system(s) to execute this task with the goal of optimizing performance. …
Supervisors:
Zoi Kaoudi
Semester: Fall 2024
Tags: big data, database, cross-platform data processing, open source, Apache
Query optimization is crucial for any data management system to achieve good performance. Recent advancements in AI have led academia and industry to investigate learning-based techniques in query optimization. In particular, many works propose replacing the cost model used during plan enumeration with a machine learning model that estimates the runtime of a plan. However, to build such a model …
Supervisors:
Zoi Kaoudi
Semester: Fall 2024
Tags: machine learning, training data, query optimizer
Query optimization is crucial for any data management system to achieve good performance. Recent advancements in AI have led academia and industry to investigate learning-based techniques in query optimization. In particular, many works propose replacing the cost model used during plan enumeration with a machine learning model (typically a regression model) that estimates the runtime of a query …
Supervisors:
Zoi Kaoudi
Semester: Fall 2024
Tags: machine learning, database, query optimization, ranking
The work on running data-intensive applications on very powerful, expensive, and power-hungry server hardware is very popular thanks to the growing size of data centers and high-performance computing (HPC) platforms. However, with the rise of new generation internet of things (IoT) applications, the lower-power and lower-budget hardware devices that specifically target IoT, the edge platforms, …
Supervisors:
Pınar Tözün
Semester: Fall 2024
Tags: edge, benchmarking, data-intensive applications, resource-constrained hardware
Observing how well machine learning systems utilize hardware resources is a crucial preliminary step to improve system performance and reduce hardware waste. To do such observations, one has to collect a lot of monitoring data on hardware behavior through experiments. In our group, we have recently built a framework to aid the management of such monitoring data efficiently, called Resource-Aware …
Supervisors:
Pınar Tözün, Ties Robroek
Semester: Fall 2024
Tags: benchmarking, data management, data visualization
Deep learning changed the landscape of many applications like computer vision, natural language processing, etc. On the other hand, deep learning require gigantic computing power offered by modern hardware. As a result data scientists rely on powerful hardware resources offered by shared high-performance computing (HPC) clusters or the cloud. Due to the long-running times of deep learning …
Supervisors:
Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2024
Tags: machine learning systems, checkpointing, scheduling, resource management
Workload collocation has been shown as an effective method to reduce the hardware requirements for certain deep learning (DL) training tasks. On the other hand, there hasn’t been many robust open-source implementations of schedulers that incorporate workload collocation on GPUs for DL.
BLOX is a framework that aims at standardizing the way we implement deep learning schedulers. In this …
Supervisors:
Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2024
Tags: machine learning systems, scheduling, resource management, workload collocation
Deep convolutional networks are able to learn representation of images, scoring well in tasks such as image classification and object detection. During model training, these networks have the ability to process different input sizes without requiring changes to their architecture. In this project, we would like to investigate the effects that changing input sizes has on these kinds of models. We …
Supervisors:
Pınar Tözün, Ties Robroek
Semester: Fall 2024
Tags: data attribution, deep learning, machine learning, resource efficiency
Today’s foundation models are trained on vast amounts of data. The quality and size of this data has a huge impact on the accuracy of these models. Selecting the right amount and variety of data for a given task, however, is a resource-intensive process. In this project, we would like to investigate various state-of-the-art data selection mechanisms from a hardware requirements and …
Supervisors:
Pınar Tözün, Ties Robroek
Semester: Fall 2024
Tags: data selection, deep learning, machine learning, resource efficiency
Traditionally solid-state drives (SSDs) does not give the users the ability to control the data placement on the SSD. This often leads to suboptimal performance and lowers SSD lifetime, since SSDs internally don’t allow in-place updates. The updated disk pages are written elsewhere and the old versions have to be garbage collected. This poses problems if data with different lifetimes and …
Supervisors:
Pınar Tözün
Semester: Fall 2024
Tags: SSDs, data management systems, modern storage
In this project, we would specifically like to quantify the data movement savings of applying techniques like compression and model-based data filtering in the context of resource-constrained hardware and edge/IoT applications.
Today many data sources are small low-powered and hardware-constrained devices such as mobile phones, wearable or self-driving smart platforms, etc. Processing the data on …
Supervisors:
Pınar Tözün, Robert Bayer
Semester: Fall 2024
Tags: resource-constrained hardware, data management, ML model updates, tinyML
One of the key challenges with enabling efficient machine learning on resource-constrained devices is keeping the machine learning models deployed on these devices up-to-date without frequent retraining. This requires exploring the impact of different model update mechanisms at the edge.
This project would be suitable as a standalone project or BSc or MSc thesis at ITU during Fall 2024. If you are …
Supervisors:
Pınar Tözün, Robert Bayer
Semester: Fall 2024
Tags: resource-constrained hardware, data management, ML model updates, tinyML
To enable efficient data processing and machine learning on resource-constrained devices has many challenges. One is fitting the models into the restrictive memory and compute resources of these devices. In this project, first, we would like to explore the landscape of foundational, generative-AI, language, etc. models with respect to their size and compute needs to understand what could be a fit …
Supervisors:
Pınar Tözün, Robert Bayer
Semester: Fall 2024
Tags: resource-constrained hardware, data management, ML model updates, tinyML
Today many data sources are small low-powered and hardware-constrained devices such as mobile phones, wearable or self-driving smart platforms, etc. Edge computing is a broad term that refers to computations performed on such edge devices. It becomes increasingly important to enable techniques that get more value out of data at the edge rather than always sending the data to a remote and more …
Supervisors:
Pınar Tözün, Robert Bayer
Semester: Fall 2024
Tags: resource-constrained hardware, data management, resource management, tinyML
This is not a single project, but rather a larger cluster of potential projects in the field of what could be summarized as extreme networking.
The networks we are interested in are typically wireless, and can be extreme in different senses of the word:
distance - hundreds of kilometers terrestrial, 10,000s of km to satellite latency - sub-ms latencies autonomy - off-grid quality - extreme remote …
Supervisors:
Sebastian Büttrich
Semester: Fall 2024
Tags: network, IoT, LoRa, LoRaWAN, satellites
LoRa is a long range, low bandwith networking protocol widely used in Internet of Things projects, sensor networks, low power, low cost and embedded systems. LoRa’s encoding schema allows for extremely long distance communications with small power usage and small simple antennas. This combination of features has made it attractive to small satellite operators flying cubesats and LoRa is now …
Supervisors:
Sebastian Büttrich
Semester: Fall 2024
Tags: satellites, LoRa, cubesat, IoT, embedded, electronics
LoRa is a long range, low bandwith networking protocol widely used in Internet of Things projects, sensor networks, low power, low cost and embedded systems. LoRa’s encoding schema allows for extremely long distance communications with small power usage and small simple antennas. This combination of features has made it attractive to small satellite operators flying cubesats and LoRa is now …
Supervisors:
Sebastian Büttrich
Semester: Fall 2024
Tags: IoT, LoRa, LoRaWAN, satellites
Optical fiber is the backbone of the internet’s communication, e.g. in the form of submarine fiber cables. It can also be employed as a sensor device, by means of combined opto-acoustic methods such as Distributed acoustic sensing (DAS) or State of Polarisation (SoP) sensing. Fiber is cabapble of sensing all kinds of vibrational/acoustic events, from animal sounds over seismic activity to …
Supervisors:
Sebastian Büttrich
Semester: Fall 2024
Tags: fiber, acoustics, audio, machine learning, DAS, SOP
Spectral learning priority is a useful tool in analyzing a model’s focus during training, it describes how a model may understand a given image from the spectrum perspective. For example, to distinguish cats and tortoises, learning to recognize their shapes would be enough, such embedding will result in higher learning priority at low frequencies representing shapes; while learning to …
Supervisors:
Yucheng Lu, Veronika Cheplygina
Semester: Fall 2024
Tags: Spectral analysis, Image classification, Medical imaging
The DISCO-2 project is driven by students and aims to develop and deploy a 3-unit CubeSat into low Earth orbit. Its mission focuses on conducting Earth observations over Greenland and supporting various research objectives. The satellite has three cameras onboard: infrared, wide-angle, and standard (main camera). Due to the limitations of the imaging hardware and the challenging conditions on the …
Supervisors:
Yucheng Lu, Julian Priest
Semester: Fall 2024
Tags: Image enhancement, Image processing, Machine learning
The DISCO-2 project is driven by students and aims to develop and deploy a 3-unit CubeSat into low Earth orbit. Its mission focuses on conducting Earth observations over Greenland and supporting various research objectives. The satellite has three cameras onboard: infrared, wide-angle, and standard (main camera). Due to the limitations of the imaging hardware and the challenging conditions on the …
Supervisors:
Yucheng Lu, Julian Priest
Semester: Fall 2024
Tags: Image enhancement, Image processing, Machine learning
Observing how well machine learning systems utilize hardware resources is a crucial preliminary step to improve system performance and reduce hardware waste. To do such observations, one has to collect a lot of monitoring data on hardware behavior through experiments. In our group, we have recently built a framework to aid the management of such monitoring data efficiently, called Resource-Aware …
Supervisors:
Pınar Tözün, Ties Robroek
Semester: Fall 2024
Tags: benchmarking, data management, data visualization
Deep convolutional networks are able to learn representation of images, scoring well in tasks such as image classification and object detection. During model training, these networks have the ability to process different input sizes without requiring changes to their architecture. In this project, we would like to investigate the effects that changing input sizes has on these kinds of models. We …
Supervisors:
Pınar Tözün, Ties Robroek
Semester: Fall 2024
Tags: data attribution, deep learning, machine learning, resource efficiency
Today’s foundation models are trained on vast amounts of data. The quality and size of this data has a huge impact on the accuracy of these models. Selecting the right amount and variety of data for a given task, however, is a resource-intensive process. In this project, we would like to investigate various state-of-the-art data selection mechanisms from a hardware requirements and …
Supervisors:
Pınar Tözün, Ties Robroek
Semester: Fall 2024
Tags: data selection, deep learning, machine learning, resource efficiency
Deep learning changed the landscape of many applications like computer vision, natural language processing, etc. On the other hand, deep learning require gigantic computing power offered by modern hardware. As a result data scientists rely on powerful hardware resources offered by shared high-performance computing (HPC) clusters or the cloud. Due to the long-running times of deep learning …
Supervisors:
Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2024
Tags: machine learning systems, checkpointing, scheduling, resource management
Workload collocation has been shown as an effective method to reduce the hardware requirements for certain deep learning (DL) training tasks. On the other hand, there hasn’t been many robust open-source implementations of schedulers that incorporate workload collocation on GPUs for DL.
BLOX is a framework that aims at standardizing the way we implement deep learning schedulers. In this …
Supervisors:
Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2024
Tags: machine learning systems, scheduling, resource management, workload collocation
In this project, we would specifically like to quantify the data movement savings of applying techniques like compression and model-based data filtering in the context of resource-constrained hardware and edge/IoT applications.
Today many data sources are small low-powered and hardware-constrained devices such as mobile phones, wearable or self-driving smart platforms, etc. Processing the data on …
Supervisors:
Pınar Tözün, Robert Bayer
Semester: Fall 2024
Tags: resource-constrained hardware, data management, ML model updates, tinyML
One of the key challenges with enabling efficient machine learning on resource-constrained devices is keeping the machine learning models deployed on these devices up-to-date without frequent retraining. This requires exploring the impact of different model update mechanisms at the edge.
This project would be suitable as a standalone project or BSc or MSc thesis at ITU during Fall 2024. If you are …
Supervisors:
Pınar Tözün, Robert Bayer
Semester: Fall 2024
Tags: resource-constrained hardware, data management, ML model updates, tinyML
To enable efficient data processing and machine learning on resource-constrained devices has many challenges. One is fitting the models into the restrictive memory and compute resources of these devices. In this project, first, we would like to explore the landscape of foundational, generative-AI, language, etc. models with respect to their size and compute needs to understand what could be a fit …
Supervisors:
Pınar Tözün, Robert Bayer
Semester: Fall 2024
Tags: resource-constrained hardware, data management, ML model updates, tinyML
Today many data sources are small low-powered and hardware-constrained devices such as mobile phones, wearable or self-driving smart platforms, etc. Edge computing is a broad term that refers to computations performed on such edge devices. It becomes increasingly important to enable techniques that get more value out of data at the edge rather than always sending the data to a remote and more …
Supervisors:
Pınar Tözün, Robert Bayer
Semester: Fall 2024
Tags: resource-constrained hardware, data management, resource management, tinyML
It has been observed that deep learning models are able to identify patient characteristics such as age, sex, and self-reported race with high accuracy from medical images such as chest x-ray recordings, even when medical doctors cannot. This raises the potential for such models to learn to (falsely) diagnose patients of different demographics differently, even if they present with the same …
Supervisors:
Amelia Jiménez-Sánchez, Eike Petersen, Veronika Cheplygina
Semester: Fall 2024
Tags: machine learning, data science, medical imaging
Concept Bottleneck Models [1] are designed to leverage high-level concepts. They revisit the classic idea of first predicting concepts that are providing at training time, and then using these concepts to predict the label. By construction, it is possible to intervene on these concept bottleneck models by editing their predicted concept values and propagating these changes to the final prediction. …
Supervisors:
Amelia Jiménez-Sánchez
Semester: Spring 2025
Tags: machine learning, data science, medical imaging
There have been several situations where machine learning classifiers, trained to diagnose a particular disease (for example, lung cancer from chest x-rays), overfit on hidden features within the data. Examples include gridlines, surgical markers or evidence of treatment or text present in the images (see references for examples). This causes the classifier to fail on other type of images. …
Supervisors:
Veronika Cheplygina, Amelia Jiménez-Sánchez
Semester: Spring 2025
Tags: machine learning, data science, medical imaging
It has been observed that deep learning models are able to identify patient characteristics such as age, sex, and self-reported race with high accuracy from medical images such as chest x-ray recordings, even when medical doctors cannot. This raises the potential for such models to learn to (falsely) diagnose patients of different demographics differently, even if they present with the same …
Supervisors:
Amelia Jiménez-Sánchez, Eike Petersen, Veronika Cheplygina
Semester: Fall 2024
Tags: machine learning, data science, medical imaging