Data Preprocessing Pipelines

Supervisors: Pınar Tözün
Semester: Fall 2022
Tags: data preprocessing libraries, heterogeneous hardware, machine learning

It is common to process data to clean it, filter it, restructure it, get metadata out of it, etc. before feeding the data into a data analysis or machine learning pipeline. There are many tools and libraries out there to aide with this process with different strengths and functionality (DALI, RAPIDS, HoloClean, DAPHNE, DuckDB, etc.). In this project, we would like to analyze pros/cons of some of these tools on a variety of hardware platforms and use cases. The project is in many ways open and can be done as a regular semester project, a BSc thesis, or an MSc thesis. Based on the interests of the student, we can adjust the tools, hardware devices, and use cases to focus on.