PROPOSAL

Data preparation for learning-based query optimization


Supervisors: Xiao Li, Zoi Kaoudi
Semester: Spring 2026
Tags: data preparation, query optimization, machine learning, database

Query optimization lies at the core of database systems and learning-based query optimization attracts more and more attention because AI techniques are expected to bring new opportunities to further improve the task. As is generally believed, data preparation plays an important role in machine learning tasks, and this may also apply to learning-based query optimization. However, it is noted that existing works on learning-based query optimization vary largely in the stage of data preparation such as what kinds of queries/plans to use, what features to select/convert, what labels to use, etc.. These may lead to the question: what could be the most effective way to prepare the data for the learned query optimizers to digest in order to obtain the best achievable benefits? To answer this question, one may need to explore aspects such as what queries/plans may be more representative, which features are the most helpful, or just case-by-case, and what labels to collect, considering collection efficiency and prediction accuracy. Besides, it is desirable to give a proper interpretation of the data preparation process.

The expected output of this project is to find out more effective and efficient ways of preparing the data for learned query optimization tasks (e.g., cardinality or cost estimation) and hopefully also more explainable than those from existing works.

Prerequisites: programming skills in Python; good knowledge of machine learning, and basic knowledge of the internals of database systems (query optimization).