PROPOSAL

Good or Bad: LLM-Generated Datasets

Supervisors: Martin Hentschel
Semester: Fall 2025
Tags: training data, machine learning, LLMs

(MSc Research Project / MSc Thesis)

The goal of this project is to research how AI and large language models generate datasets. Research questions include: Where does the generated data come from? Are sources available on the internet or can they be found? What biases exist in the generated data? And how much of the data is simply wrong? Generated datasets are used in many fields in practice, including journalism (e.g., when creating diagrams to be included in an article) and software engineering (e.g., producing data to test software). Having incorrect, poor, or biased data could impact journalism, software engineering, and many other areas that rely on such data. This project is suited for Data Science / Software Development students, but Computer Science students are also welcome, of course.