PROPOSAL

Min-Max Statistics in DuckDB


Supervisors: Martin Hentschel
Semester: Fall 2025
Tags: query optimizer

(MSc Research Project / MSc Thesis)

DuckDB stores data in a database file. The database file is split into partitions, and for each partition, DuckDB keeps statistics in the form of min-max summaries. Similar min-max summaries exist in Parquet files and many other database systems, such as Snowflake. Check out Lecture 8 of last semester’s Introduction to Database Systems course for a summary of min-max summaries (I can send the slide deck if needed).

In this project, the goal is to extend the min-max statistic usage of DuckDB by storing and using clusters of min-max per partition. This can be helpful if the data has “holes”. Clusters of min-max statistics would avoid such holes, and query answering can be sped up because all these holes are not included in the computation of the query result. This project is very implementation-heavy and therefore most suited for Computer Science students with knowledge of C++.