PROPOSAL

Checkpointing during Deep Learning Training


Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, checkpointing, scheduling, resource management

Deep learning training can run for hours or days, causing long queue times and poor quality of service (QoS) on shared clusters. Since schedulers can’t accurately predict training durations, they often wait for jobs to finish or time out, worsening the issue.

Frameworks like TensorFlow and PyTorch support model checkpointing, but frequent checkpoints can slow training, while infrequent ones reduce effectiveness. Automating checkpointing could improve scheduler decisions and fairness.

This project aims to (1) benchmark checkpointing overhead in TensorFlow and PyTorch, (2) identify optimal checkpointing frequencies, and (3) develop an automatic checkpointing module.

Suitable for a BSc or MSc thesis. Ideal for students interested in machine learning systems and efficiency. Scope can be adjusted based on thesis level and group size.