machine learning systems -- DASYA

PROPOSAL

Evaluating the Impact of Collocating Deep Learning Training Tasks on Jetson Orion Nano GPUs

This project investigates how running multiple deep learning training tasks simultaneously (collocation) affects performance on resource-constrained edge devices, specifically the NVIDIA Jetson Orion Nano. Students will deploy and benchmark various models (e.g., CNNs, Transformers) in isolated vs. different collocated scenarios, measure metrics such as GPU utilization, memory usage, training time, …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, GPU Utilization, resource management, resource interference

PROPOSAL

BLOX for Deep Learning Task Scheduling with GPU Collocation

Workload collocation has been shown as an effective method to reduce the hardware requirements for certain deep learning (DL) training tasks. On the other hand, there hasn’t been many robust open-source implementations of schedulers that incorporate workload collocation on GPUs for DL. BLOX is a framework that aims at standardizing the way we implement deep learning schedulers. In this …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, scheduling, resource management, workload collocation

PROPOSAL

Checkpointing during Deep Learning Training

Deep learning training can run for hours or days, causing long queue times and poor quality of service (QoS) on shared clusters. Since schedulers can’t accurately predict training durations, they often wait for jobs to finish or time out, worsening the issue. Frameworks like TensorFlow and PyTorch support model checkpointing, but frequent checkpoints can slow training, while infrequent ones reduce …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, checkpointing, scheduling, resource management

PROPOSAL

GPU Memory Dataset with a focus on Transformer-Based Models

This project focuses on extending an existing dataset for predicting GPU memory requirements during deep learning training by incorporating transformer-based models such as BERT, GPT, and their variants. The student will study the architecture of these models and develop training scripts to run them under controlled conditions. During training, key GPU metrics—including memory usage, utilization, …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, GPU Memory Requirement, GPU Utilization, resource management

PROPOSAL

Predicting GPU utilization for Deep learning training

GPU offers massive computational power and parallelism through its Streaming Multiprocessors (SMs). Efficient GPU utilization is critical for maximizing performance and optimizing compute resource usage, which is measured using various metrics such as SMACT (SM Activity) and SMOCC (SM Occupancy), and DRAMA (DRAM Active). These metrics provide insight into how effectively the GPU’s SMs and …
Supervisors: Pınar Tözün, Ehsan Yousefzadeh-Asl-Miandoab
Semester: Fall 2025
Tags: machine learning systems, GPU Utilization, resource management, resource interference

Tagged with: machine learning systems

PROPOSAL

Evaluating the Impact of Collocating Deep Learning Training Tasks on Jetson Orion Nano GPUs

PROPOSAL

BLOX for Deep Learning Task Scheduling with GPU Collocation

PROPOSAL

Checkpointing during Deep Learning Training

PROPOSAL

GPU Memory Dataset with a focus on Transformer-Based Models

PROPOSAL

Predicting GPU utilization for Deep learning training