PROPOSAL
Checkpointing during Deep Learning Training
Deep learning training can run for hours or days, causing long queue times and poor quality of service (QoS) on shared clusters. Since schedulers can’t accurately predict training durations, they often wait for jobs to finish or time out, worsening the issue.
Frameworks like TensorFlow and PyTorch support model checkpointing, but frequent checkpoints can slow training, while infrequent ones reduce effectiveness. Automating checkpointing could improve scheduler decisions and fairness.
This project aims to (1) benchmark checkpointing overhead in TensorFlow and PyTorch, (2) identify optimal checkpointing frequencies, and (3) develop an automatic checkpointing module.
Suitable for a BSc or MSc thesis. Ideal for students interested in machine learning systems and efficiency. Scope can be adjusted based on thesis level and group size.