Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

USENIX via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a comprehensive analysis of Machine Learning as a Service (MLaaS) workloads in large-scale heterogeneous GPU clusters through this 15-minute conference talk from NSDI '22. Dive into the challenges of running diverse ML workloads, including low GPU utilization, long queueing delays, and scheduling complexities. Examine a two-month workload trace from Alibaba's production MLaaS cluster with over 6,000 GPUs, and learn about current solutions and open challenges in cluster scheduling. Gain insights into resource requests, machine utilization, GPU sharing, task duration prediction, and potential CPU bottlenecks. Understand the implications of imbalanced scheduling across heterogeneous machines and discover key takeaways for optimizing large-scale ML infrastructure.

Syllabus

Intro
Production ML Workloads
Trace Overview
Run-time and Queueing delays
Resource Requests & Usage
Machine Resource Utilization
GPU Sharing
Duration Predict for Recurring Tasks
CPU can be the bottleneck
Imbalanced Scheduling
Takeaways

Taught by

USENIX

Reviews

Start your review of MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.