Overview
Explore a comprehensive analysis of Machine Learning as a Service (MLaaS) workloads in large-scale heterogeneous GPU clusters through this 15-minute conference talk from NSDI '22. Dive into the challenges of running diverse ML workloads, including low GPU utilization, long queueing delays, and scheduling complexities. Examine a two-month workload trace from Alibaba's production MLaaS cluster with over 6,000 GPUs, and learn about current solutions and open challenges in cluster scheduling. Gain insights into resource requests, machine utilization, GPU sharing, task duration prediction, and potential CPU bottlenecks. Understand the implications of imbalanced scheduling across heterogeneous machines and discover key takeaways for optimizing large-scale ML infrastructure.
Syllabus
Intro
Production ML Workloads
Trace Overview
Run-time and Queueing delays
Resource Requests & Usage
Machine Resource Utilization
GPU Sharing
Duration Predict for Recurring Tasks
CPU can be the bottleneck
Imbalanced Scheduling
Takeaways
Taught by
USENIX