Tagged with: checkpointing


PROPOSAL

Deep learning training can run for hours or days, causing long queue times and poor quality of service (QoS) on shared clusters. Since schedulers can’t accurately predict training durations, they often wait for jobs to finish or time out, worsening the issue. Frameworks like TensorFlow and PyTorch support model checkpointing, but frequent checkpoints can slow training, while infrequent ones reduce …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, checkpointing, scheduling, resource management