PROPOSAL

Training data generation for data management systems

Supervisors: Zoi Kaoudi
Semester: Fall 2025
Tags: machine learning, training data, query optimizer

Query optimization is crucial for any data management system to achieve good performance. Recent advancements in AI have led academia and industry to investigate learning-based techniques in query optimization. In particular, many works propose replacing the cost model used during plan enumeration with a machine learning model that estimates the runtime of a plan. However, to build such a model lots of training data are required. In this context, training data comprise of query plans and their runtimes. To collect thousands of data points though takes a large amount of time, i.e., days or even months when the input data size is large. For this reason, we have build https://github.com/agora-ecosystem/data-farm, an efficient data-driven training data generator that can output high-quality training data (query plans with their runtimes) in a fraction of the time required for collecting all labels manually. More details here.

Still, there are many limitations in the current state of DataFarm, such as: First, it supports only Flink jobs as query plans. Second, the training data generation should ran on the same hardware that the query optimizer is running. This is not always feasible especially when the data system is used in production. Third, DataFarm outputs as labels of the training data only execution plans. Other labels are also useful, e.g., cardinality estimates.

The project or thesis will focus on tackling one or more of the above limitations.

Prerequisites: programming skills in Python or Java and (preferably) SQL; basic foundations in machine learning, and (preferably) knowledge in big data systems (e.g., Apache Flink, Apache Spark)