MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

Overview

Explore a comprehensive analysis of Machine Learning as a Service (MLaaS) workloads in large-scale heterogeneous GPU clusters through this 15-minute conference talk from NSDI '22. Dive into the challenges of running diverse ML workloads, including low GPU utilization, long queueing delays, and scheduling complexities. Examine a two-month workload trace from Alibaba's production MLaaS cluster with over 6,000 GPUs, and learn about current solutions and open challenges in cluster scheduling. Gain insights into resource requests, machine utilization, GPU sharing, task duration prediction, and potential CPU bottlenecks. Understand the implications of imbalanced scheduling across heterogeneous machines and discover key takeaways for optimizing large-scale ML infrastructure.

Syllabus

Intro
Production ML Workloads
Trace Overview
Run-time and Queueing delays
Resource Requests & Usage
Machine Resource Utilization
GPU Sharing
Duration Predict for Recurring Tasks
CPU can be the bottleneck
Imbalanced Scheduling
Takeaways

Taught by

USENIX

Reviews

Start your review of MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

Taught by

Themis - Fair and Efficient GPU Cluster Scheduling

Better Together - Jointly Optimizing ML Collective Scheduling and Execution Planning Using SYNDICATE

Never Stop Learning.