Deep learning changed the landscape of many applications like computer vision, natural language processing, etc. On the other hand, deep learning require gigantic computing power offered by modern hardware. As a result data scientists rely on powerful hardware resources offered by shared high-performance computing (HPC) clusters or the cloud. Due to the long-running times of deep learning …
Supervisors:
Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2024
Tags: machine learning systems, checkpointing, scheduling, resource management
Workload collocation has been shown as an effective method to reduce the hardware requirements for certain deep learning (DL) training tasks. On the other hand, there hasn’t been many robust open-source implementations of schedulers that incorporate workload collocation on GPUs for DL.
BLOX is a framework that aims at standardizing the way we implement deep learning schedulers. In this …
Supervisors:
Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2024
Tags: machine learning systems, scheduling, resource management, workload collocation