Project Proposals -- DASYA

PROPOSAL

Sensor nodes for stratospheric balloon missions

Stratospheric balloon missions (to a hight of maximum 50 km) are both a “warm-up” and exercise in the preparation for our cubesat satellite missions, and interesting in themselves - to a growing extent because of the heightened attention for stratospheric pollution by the uncontrolled and fast accelerating de-orbiting of low earth orbit satellites into atmosphere. While there is a …
Supervisors: Sebastian Büttrich
Semester: Fall 2025
Tags: network, IoT, LoRa, LoRaWAN, satellites

PROPOSAL

Extreme networking

This is not a single project, but rather a larger cluster of potential projects in the field of what could be summarized as extreme networking. The networks we are interested in are typically wireless, and can be extreme in different senses of the word: distance - hundreds of kilometers terrestrial, 10,000s of km to satellite latency - sub-ms latencies autonomy - off-grid quality - extreme remote …
Supervisors: Sebastian Büttrich
Semester: Fall 2025
Tags: network, IoT, LoRa, LoRaWAN, satellites

PROPOSAL

Cubesat LoRa module

LoRa is a long range, low bandwith networking protocol widely used in Internet of Things projects, sensor networks, low power, low cost and embedded systems. LoRa’s encoding schema allows for extremely long distance communications with small power usage and small simple antennas. This combination of features has made it attractive to small satellite operators flying cubesats and LoRa is now …
Supervisors: Sebastian Büttrich
Semester: Fall 2025
Tags: satellites, LoRa, cubesat, IoT, embedded, electronics

PROPOSAL

Innovative Satellite LoRa use cases

LoRa is a long range, low bandwith networking protocol widely used in Internet of Things projects, sensor networks, low power, low cost and embedded systems. LoRa’s encoding schema allows for extremely long distance communications with small power usage and small simple antennas. This combination of features has made it attractive to small satellite operators flying cubesats and LoRa is now …
Supervisors: Sebastian Büttrich
Semester: Fall 2025
Tags: IoT, LoRa, LoRaWAN, satellites

PROPOSAL

Machine learning on optical fiber sensor data

Optical fiber is the backbone of the internet’s communication, e.g. in the form of submarine fiber cables. It can also be employed as a sensor device, by means of combined opto-acoustic methods such as Distributed acoustic sensing (DAS) or State of Polarisation (SoP) sensing. Fiber is cabapble of sensing all kinds of vibrational/acoustic events, from animal sounds over seismic activity to …
Supervisors: Sebastian Büttrich
Semester: Fall 2025
Tags: fiber, acoustics, audio, machine learning, DAS, SOP

PROPOSAL

Danish Student Cubesat

The Danish Student Cubesat Program is an inter university collaboration that will launch 3 cubesats into Low Earth Orbit over the next 4 years. The satellites will be designed, operated, programmed and built by students and the project offers an opportunity for Master’s students to take part in a live satellite project. ITU is partnering with Aarhus University on DISCOSAT2 which will be an …
Supervisors: Sebastian Büttrich, Julian Priest
Semester: Fall 2021
Tags: Satellite, Cubesat, Image processing, Machine Learning, edge, constrained computing

PROPOSAL

Evaluating the Impact of Collocating Deep Learning Training Tasks on Jetson Orion Nano GPUs

This project investigates how running multiple deep learning training tasks simultaneously (collocation) affects performance on resource-constrained edge devices, specifically the NVIDIA Jetson Orion Nano. Students will deploy and benchmark various models (e.g., CNNs, Transformers) in isolated vs. different collocated scenarios, measure metrics such as GPU utilization, memory usage, training time, …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, GPU Utilization, resource management, resource interference

PROPOSAL

Alternative IO backends for Database Systems and SSDs

A recent MSc thesis project at ITU integrated xNVMe to DuckDB. DuckDB, like many popular database systems, rely on a syncronous IO backend and the POSIX filesystem interfaces for its ease of use. xNVMe, in contrast, allows a unified interface for alternative and more performant IO backends such as io-uring and spdk. Thus, integration of xNVMe into DuckDB offers interesting avenues for data …
Supervisors: Pınar Tözün
Semester: Fall 2025
Tags: SSDs, data management systems, modern storage

PROPOSAL

Benchmarking Edge Devices for Data-Intensive Applications

The work on running data-intensive applications on very powerful, expensive, and power-hungry server hardware is very popular thanks to the growing size of data centers and high-performance computing (HPC) platforms. However, with the rise of new generation internet of things (IoT) applications, the lower-power and lower-budget hardware devices that specifically target IoT, the edge platforms, …
Supervisors: Pınar Tözün
Semester: Fall 2025
Tags: edge, benchmarking, data-intensive applications, resource-constrained hardware

PROPOSAL

Data Attribution on Progressive Datasets for Deep Learning

Deep convolutional networks are able to learn representation of images, scoring well in tasks such as image classification and object detection. During model training, these networks have the ability to process different input sizes without requiring changes to their architecture. In this project, we would like to investigate the effects that changing input sizes has on these kinds of models. We …
Supervisors: Pınar Tözün, Ties Robroek
Semester: Fall 2025
Tags: data attribution, deep learning, machine learning, resource efficiency

PROPOSAL

Efficient Data Selection Methods for Machine Learning

Today’s foundation models are trained on vast amounts of data. The quality and size of this data has a huge impact on the accuracy of these models. Selecting the right amount and variety of data for a given task, however, is a resource-intensive process. In this project, which is part of a larger collaboration, we would like to expand our investigation of state-of-the-art data selection mechanisms …
Supervisors: Pınar Tözün, Ties Robroek
Semester: Fall 2025
Tags: data selection, deep learning, machine learning, resource efficiency

PROPOSAL

Framework for Systematic Performance Experiments for Machine Learning

Observing how well machine learning systems utilize hardware resources is a crucial preliminary step to improve system performance and reduce hardware waste. To do such observations, one has to collect a lot of monitoring data on hardware behavior through experiments. In our group, we have recently built a framework to aid the management of such monitoring data efficiently, called Resource-Aware …
Supervisors: Pınar Tözün, Ties Robroek
Semester: Fall 2025
Tags: benchmarking, data management, data visualization

PROPOSAL

Going Beyond Memory with GPU-based Data Analytics

This rise of hardware accelerators to meet the demand of AI workloads has also led to a variety of novel methods to leverage GPUs for traditional data analytics workloads. A key concern for any data-intensive system using GPUs is the efficiency of moving the data to the accelerator. In this project, we will investigate ways to improve data movement to GPUs by focusing on the steps of the data path …
Supervisors: Pınar Tözün
Semester: Fall 2025
Tags: SSDs, GPU-centric IO, data analytics, modern storage

PROPOSAL

Resource Management on Tiny Hardware

Today many data sources are small low-powered and hardware-constrained devices such as mobile phones, wearable or self-driving smart platforms, etc. Edge computing is a broad term that refers to computations performed on such edge devices. It becomes increasingly important to enable techniques that get more value out of data at the edge rather than always sending the data to a remote and more …
Supervisors: Pınar Tözün, Robert Bayer
Semester: Fall 2025
Tags: resource-constrained hardware, data management, resource management, tinyML

PROPOSAL

BLOX for Deep Learning Task Scheduling with GPU Collocation

Workload collocation has been shown as an effective method to reduce the hardware requirements for certain deep learning (DL) training tasks. On the other hand, there hasn’t been many robust open-source implementations of schedulers that incorporate workload collocation on GPUs for DL. BLOX is a framework that aims at standardizing the way we implement deep learning schedulers. In this …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, scheduling, resource management, workload collocation

PROPOSAL

Checkpointing during Deep Learning Training

Deep learning training can run for hours or days, causing long queue times and poor quality of service (QoS) on shared clusters. Since schedulers can’t accurately predict training durations, they often wait for jobs to finish or time out, worsening the issue. Frameworks like TensorFlow and PyTorch support model checkpointing, but frequent checkpoints can slow training, while infrequent ones reduce …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, checkpointing, scheduling, resource management

PROPOSAL

GPU Memory Dataset with a focus on Transformer-Based Models

This project focuses on extending an existing dataset for predicting GPU memory requirements during deep learning training by incorporating transformer-based models such as BERT, GPT, and their variants. The student will study the architecture of these models and develop training scripts to run them under controlled conditions. During training, key GPU metrics—including memory usage, utilization, …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, GPU Memory Requirement, GPU Utilization, resource management

PROPOSAL

Predicting GPU utilization for Deep learning training

GPU offers massive computational power and parallelism through its Streaming Multiprocessors (SMs). Efficient GPU utilization is critical for maximizing performance and optimizing compute resource usage, which is measured using various metrics such as SMACT (SM Activity) and SMOCC (SM Occupancy), and DRAMA (DRAM Active). These metrics provide insight into how effectively the GPU’s SMs and …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, GPU Utilization, resource management, resource interference

PROPOSAL

Evaluating the Impact of Collocating Deep Learning Training Tasks on Jetson Orion Nano GPUs

This project investigates how running multiple deep learning training tasks simultaneously (collocation) affects performance on resource-constrained edge devices, specifically the NVIDIA Jetson Orion Nano. Students will deploy and benchmark various models (e.g., CNNs, Transformers) in isolated vs. different collocated scenarios, measure metrics such as GPU utilization, memory usage, training time, …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, GPU Utilization, resource management, resource interference

PROPOSAL

BLOX for Deep Learning Task Scheduling with GPU Collocation

Workload collocation has been shown as an effective method to reduce the hardware requirements for certain deep learning (DL) training tasks. On the other hand, there hasn’t been many robust open-source implementations of schedulers that incorporate workload collocation on GPUs for DL. BLOX is a framework that aims at standardizing the way we implement deep learning schedulers. In this …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, scheduling, resource management, workload collocation

PROPOSAL

Checkpointing during Deep Learning Training

Deep learning training can run for hours or days, causing long queue times and poor quality of service (QoS) on shared clusters. Since schedulers can’t accurately predict training durations, they often wait for jobs to finish or time out, worsening the issue. Frameworks like TensorFlow and PyTorch support model checkpointing, but frequent checkpoints can slow training, while infrequent ones reduce …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, checkpointing, scheduling, resource management

PROPOSAL

GPU Memory Dataset with a focus on Transformer-Based Models

This project focuses on extending an existing dataset for predicting GPU memory requirements during deep learning training by incorporating transformer-based models such as BERT, GPT, and their variants. The student will study the architecture of these models and develop training scripts to run them under controlled conditions. During training, key GPU metrics—including memory usage, utilization, …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, GPU Memory Requirement, GPU Utilization, resource management

PROPOSAL

Predicting GPU utilization for Deep learning training

GPU offers massive computational power and parallelism through its Streaming Multiprocessors (SMs). Efficient GPU utilization is critical for maximizing performance and optimizing compute resource usage, which is measured using various metrics such as SMACT (SM Activity) and SMOCC (SM Occupancy), and DRAMA (DRAM Active). These metrics provide insight into how effectively the GPU’s SMs and …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, GPU Utilization, resource management, resource interference

PROPOSAL

Data Attribution on Progressive Datasets for Deep Learning

Deep convolutional networks are able to learn representation of images, scoring well in tasks such as image classification and object detection. During model training, these networks have the ability to process different input sizes without requiring changes to their architecture. In this project, we would like to investigate the effects that changing input sizes has on these kinds of models. We …
Supervisors: Pınar Tözün, Ties Robroek
Semester: Fall 2025
Tags: data attribution, deep learning, machine learning, resource efficiency

PROPOSAL

Efficient Data Selection Methods for Machine Learning

Today’s foundation models are trained on vast amounts of data. The quality and size of this data has a huge impact on the accuracy of these models. Selecting the right amount and variety of data for a given task, however, is a resource-intensive process. In this project, which is part of a larger collaboration, we would like to expand our investigation of state-of-the-art data selection mechanisms …
Supervisors: Pınar Tözün, Ties Robroek
Semester: Fall 2025
Tags: data selection, deep learning, machine learning, resource efficiency

PROPOSAL

Framework for Systematic Performance Experiments for Machine Learning

Observing how well machine learning systems utilize hardware resources is a crucial preliminary step to improve system performance and reduce hardware waste. To do such observations, one has to collect a lot of monitoring data on hardware behavior through experiments. In our group, we have recently built a framework to aid the management of such monitoring data efficiently, called Resource-Aware …
Supervisors: Pınar Tözün, Ties Robroek
Semester: Fall 2025
Tags: benchmarking, data management, data visualization

PROPOSAL

Resource Management on Tiny Hardware

Today many data sources are small low-powered and hardware-constrained devices such as mobile phones, wearable or self-driving smart platforms, etc. Edge computing is a broad term that refers to computations performed on such edge devices. It becomes increasingly important to enable techniques that get more value out of data at the edge rather than always sending the data to a remote and more …
Supervisors: Pınar Tözün, Robert Bayer
Semester: Fall 2025
Tags: resource-constrained hardware, data management, resource management, tinyML

PROPOSAL

Character Maps in Query Processing

(MSc Research Project / MSc Thesis) Character maps in database systems are specialized data structures optimized for efficient string handling. Similar to Bloom filters, they allow quick checks for the presence of characters or substrings without full string comparisons. This makes them useful for accelerating string-related queries and improving overall search performance in databases. The goal …
Supervisors: Martin Hentschel
Semester: Fall 2025
Tags: query optimizer

PROPOSAL

Database-Driven JS Game Framework

(BSc Thesis) The idea of this project is to write a game framework in JavaScript (e.g., a physics engine or state machine that keeps track of the game state) that can be re-run deterministically using databases. That is, in addition to being a standalone framework, the JavaScript framework must run as a user-defined function (UDF) in a relational database where it can be executed on thousands of …
Supervisors: Martin Hentschel
Semester: Fall 2025
Tags: JavaScript

PROPOSAL

Good or Bad: LLM-Generated Datasets

(MSc Research Project / MSc Thesis) The goal of this project is to research how AI and large language models generate datasets. Research questions include: Where does the generated data come from? Are sources available on the internet or can they be found? What biases exist in the generated data? And how much of the data is simply wrong? Generated datasets are used in many fields in practice, …
Supervisors: Martin Hentschel
Semester: Fall 2025
Tags: training data, machine learning, LLMs

PROPOSAL

Min-Max Statistics in DuckDB

(MSc Research Project / MSc Thesis) DuckDB stores data in a database file. The database file is split into partitions, and for each partition, DuckDB keeps statistics in the form of min-max summaries. Similar min-max summaries exist in Parquet files and many other database systems, such as Snowflake. Check out Lecture 8 of last semester’s Introduction to Database Systems course for a summary …
Supervisors: Martin Hentschel
Semester: Fall 2025
Tags: query optimizer

PROPOSAL

Role-Based Access Control in Data Lakes

(MSc Research Project / MSc Thesis) Role-based access control (RBAC) and data lakes do not seem to go together very well. RBAC controls who can access specific information. Data lakes allow all users to see all information. Is encryption the only way to bridge these two worlds? In this computer science-focused MSc research project and/or master’s thesis, you will investigate existing …
Supervisors: Martin Hentschel
Semester: Fall 2025
Tags: security

PROPOSAL

Learning-to-rank methods for query optimization

Query optimization is crucial for any data management system to achieve good performance. Recent advancements in AI have led academia and industry to investigate learning-based techniques in query optimization. In particular, many works propose replacing the cost model used during plan enumeration with a machine learning model (typically a regression model) that estimates the runtime of a query …
Supervisors: Zoi Kaoudi
Semester: Fall 2025
Tags: machine learning, database, query optimization, ranking

PROPOSAL

Big data processing with Apache Wayang

Are you interested in working with a big data open source project? You are welcome to conduct your thesis/project in Apache Wayang. Apache Wayang is the first cross-platform framework that allows users to specify their task/query in a system-agnostic manner and Wayang will determine which is the best system(s) to execute this task with the goal of optimizing performance. For a general overview …
Supervisors: Zoi Kaoudi
Semester: Fall 2025
Tags: big data, database, cross-platform data processing, open source, Apache

PROPOSAL

Combine Knowledge Graph Embeddings and Reasoning

Knowledge graphs (KGs) are extensively used in many application domains, such as search engines, product recommendation, and bioinformatics. Knowledge graph completion (a.k.a.~link prediction), i.e.,~the task of inferring missing information from knowledge graphs, is a widely used task in the above applications. This project will investigate how to loosely-couple the data-driven power of knowledge …
Supervisors: Zoi Kaoudi
Semester: Fall 2025
Tags: knowledge graph, LLMs, reasoning

PROPOSAL

MCP server for Apache Wayang

Are you interested in working with a big data open source project and AI? You are welcome to conduct your thesis/project in the context of Apache Wayang. Apache Wayang is the first cross-platform framework that allows users to specify their task/query in a system-agnostic manner and Wayang will determine which is the best system(s) to execute this task with the goal of optimizing performance. For …
Supervisors: Zoi Kaoudi
Semester: Fall 2025
Tags: big data, AI, LLMs, cross-platform data processing, open source, Apache

PROPOSAL

Reduce energy consumption with Apache Wayang

Are you interested in working with a big data open source project and help the environment? You are welcome to conduct your thesis/project in Apache Wayang. Apache Wayang is the first cross-platform framework that allows users to specify their task/query in a system-agnostic manner and Wayang will determine which is the best system(s) to execute this task with the goal of optimizing performance. …
Supervisors: Zoi Kaoudi
Semester: Fall 2025
Tags: big data, database, cross-platform data processing, open source, Apache

PROPOSAL

Training data generation for data management systems

Query optimization is crucial for any data management system to achieve good performance. Recent advancements in AI have led academia and industry to investigate learning-based techniques in query optimization. In particular, many works propose replacing the cost model used during plan enumeration with a machine learning model that estimates the runtime of a plan. However, to build such a model …
Supervisors: Zoi Kaoudi
Semester: Fall 2025
Tags: machine learning, training data, query optimizer

PROPOSAL

Data representativity, similarity, and diversity

Machine learning methods are often evaluated on benchmark datasets, in computer vision, medical imaging, NLP and other fields. In such evaluation, researchers often describe the data as being: representative, for example based on the distribution of ages of the patients mirroring the world population, similar, for example because both dataset contain pictures of animals diverse, for example …
Supervisors: Veronika Cheplygina
Semester: Spring 2025
Tags: machine learning, medical imaging, data analysis, meta-research

PROPOSAL

Danish Student Cubesat

The Danish Student Cubesat Program is an inter university collaboration that will launch 3 cubesats into Low Earth Orbit over the next 4 years. The satellites will be designed, operated, programmed and built by students and the project offers an opportunity for Master’s students to take part in a live satellite project. ITU is partnering with Aarhus University on DISCOSAT2 which will be an …
Supervisors: Sebastian Büttrich, Julian Priest
Semester: Fall 2021
Tags: Satellite, Cubesat, Image processing, Machine Learning, edge, constrained computing

PROPOSAL

15-minutes cities visualisation

The idea behind “15-minutes cities” is that within a short walk or bike ride people should have access to all necessary facilities that constitute the essence of urban living, such as parks, shops, cafes, schools, hospitals. Initiatives to transform cities according to this paradigm are currently being implemented across the world, in an attempt to make urban spaces more liveable, …
Supervisor: Maria Astefanoaei
Semester: Fall 2021
Tags: spatial data analysis, visualisation, Python, OSM data

PROPOSAL

Algorithms for data-aware cycling network expansion

As a response to increased traffic congestion and the need to reduce carbon emissions, cities consider ways to modernise, build and extend transit systems. Transit network design solutions can benefit from analysing the large amount of crowd-sourced location data available, which provides valuable insights into population mobility needs. Designing efficient metro lines, bicycle paths, or bus …
Supervisor: Maria Astefanoaei
Semester: Fall 2021
Tags: spatial data analysis, network design, Python, OSM data

PROPOSAL

Graph summaries of accessibility maps

The idea behind “15-minutes cities” is that within a short walk or bike ride people should have access to all necessary facilities that constitute the essence of urban living, such as parks, shops, cafes, schools, hospitals. Initiatives to transform cities according to this paradigm are currently being implemented across the world, in an attempt to make urban spaces more liveable, …
Supervisor: Maria Astefanoaei
Semester: Fall 2021
Tags: spatial data analysis, graph summaries, Python, OSM data

PROPOSAL

Music genre embeddings

Musical genres are inherently ambiguous and difficult to define. Even more so is the task of establishing how genres relate to one another. Yet, genre is perhaps the most common and effective way of describing musical experience. The number of possible genre classifications (e.g. Spotify has over 4000 genre tags, LastFM over 500,000 tags) has made the idea of manually creating music taxonomies …
Supervisor: Maria Astefanoaei
Semester: Fall 2021
Tags: scalable algorithms, hyperbolic embeddings, Python, Spotify data

Supervisor: Sebastian Büttrich

PROPOSAL

Sensor nodes for stratospheric balloon missions

PROPOSAL

Extreme networking

PROPOSAL

Cubesat LoRa module

PROPOSAL

Innovative Satellite LoRa use cases

PROPOSAL

Machine learning on optical fiber sensor data

PROPOSAL

Danish Student Cubesat

Supervisor: Pınar Tözün

PROPOSAL

Evaluating the Impact of Collocating Deep Learning Training Tasks on Jetson Orion Nano GPUs

PROPOSAL

Alternative IO backends for Database Systems and SSDs

PROPOSAL

Benchmarking Edge Devices for Data-Intensive Applications

PROPOSAL

Data Attribution on Progressive Datasets for Deep Learning

PROPOSAL

Efficient Data Selection Methods for Machine Learning

PROPOSAL

Framework for Systematic Performance Experiments for Machine Learning

PROPOSAL

Going Beyond Memory with GPU-based Data Analytics

PROPOSAL

Resource Management on Tiny Hardware

PROPOSAL

BLOX for Deep Learning Task Scheduling with GPU Collocation

PROPOSAL

Checkpointing during Deep Learning Training

PROPOSAL

GPU Memory Dataset with a focus on Transformer-Based Models

PROPOSAL

Predicting GPU utilization for Deep learning training

Supervisor: Ehsan Yousefzadeh-Asl-Miandoab

PROPOSAL

Evaluating the Impact of Collocating Deep Learning Training Tasks on Jetson Orion Nano GPUs

PROPOSAL

BLOX for Deep Learning Task Scheduling with GPU Collocation

PROPOSAL

Checkpointing during Deep Learning Training

PROPOSAL

GPU Memory Dataset with a focus on Transformer-Based Models

PROPOSAL

Predicting GPU utilization for Deep learning training

Supervisor: Ties Robroek

PROPOSAL

Data Attribution on Progressive Datasets for Deep Learning

PROPOSAL

Efficient Data Selection Methods for Machine Learning

PROPOSAL

Framework for Systematic Performance Experiments for Machine Learning

Supervisor: Robert Bayer

PROPOSAL

Resource Management on Tiny Hardware

Supervisor: Martin Hentschel

PROPOSAL

Character Maps in Query Processing

PROPOSAL

Database-Driven JS Game Framework

PROPOSAL

Good or Bad: LLM-Generated Datasets

PROPOSAL

Min-Max Statistics in DuckDB

PROPOSAL

Role-Based Access Control in Data Lakes

Supervisor: Zoi Kaoudi

PROPOSAL

Learning-to-rank methods for query optimization

PROPOSAL

Big data processing with Apache Wayang

PROPOSAL

Combine Knowledge Graph Embeddings and Reasoning

PROPOSAL

MCP server for Apache Wayang

PROPOSAL