Analysis of Checkpointing during Deep Learning Model Training

Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2024
Tags: machine learning systems, checkpointing, scheduling, resource management

Deep learning changed the landscape of many applications like computer vision, natural language processing, etc. On the other hand, deep learning require gigantic computing power offered by modern hardware. As a result data scientists rely on powerful hardware resources offered by shared high-performance computing (HPC) clusters or the cloud. Due to the long-running times of deep learning training, users of such shared cluster computing platforms experience low quality of service (QoS) by being queued and waiting to get their execution time. As schedulers cannot reliably predict the amount of time required to finish a training task, they wait till the workload finishes or its requested time slot expires, which amplifies QoS issues.

To battle this issue, deep learning frameworks like TensorFlow and PyTorch have checkpointing mechanisms that efficiently checkpoints the trained model on the specified interval by the users. If the user decides to do the process frequently it can impose performance degradation issues as the checkpointing does not come at zero cost. On the other hand, doing it too infrequently may be ineffective against QoS issues. If checkpointing can be done in an automatic way, it could help with better scheduling decisions and make it possible to have better fairness in the way tasks are scheduled.

The goal in this project is to first figure out the impact of checkpointing with TensorFlow and PyTorch using a representative small, medium, and large deep learning training workloads. Then, based on the results, we would like to find the appropriate frequency for checkpointing. Finally, we would like to incorporate a module that enables automatic checkpointing in deep learning frameworks.

This project would be suitable as a BSc or MSc thesis as well as a standalone project at ITU during Fall 2024. If you are interested in machine learning systems and their efficiency in general, this project would be a great fit for you. Depending on the size of the project or thesis (BSc, MSc, etc.) and the number of students in the group, we can adjust the steps of the project (i.e., doing the first step vs all three steps).