Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Overview

Explore a 20-minute conference talk from OSDI '20 that delves into Gavel, a novel heterogeneity-aware scheduler for deep learning workloads. Learn how Gavel addresses the challenges of heterogeneous performance across specialized accelerators and diverse scheduling objectives in cluster management. Discover the concept of effective throughput and how it's used to transform existing scheduling policies into heterogeneity-aware versions. Understand Gavel's round-based scheduling mechanism and its ability to optimize resource allocation in heterogeneous clusters. Examine the performance improvements Gavel offers, including higher input load sustainability and significant enhancements in makespan and average job completion time compared to heterogeneity-agnostic policies.

Syllabus

Intro
Hardware for ML training is becoming highly specialized and heterogeneous!
How should we allocate heterogeneous resources?
Challenge 1: Heterogeneous performance
Challenge 2: Diverse scheduling objectives
Related work
Gavel: A new heterogeneity-aware cluster scheduler
Scheduling policies to be made heterogeneity-aware
Policies as optimization problems
Allocations (x) as time fractions
Effective throughput
Performance optimizations: space sharing and placement
How do we realize an optimal allocation?
Gavel's round-based scheduling
Main questions
Gavel improves objectives on a heterogeneous cluster
Gavel can enable the same heterogeneous cluster to support higher input load
Gavel can support hierarchical policies
Gavel scales to clusters with hundreds of active jobs
Conclusion

Taught by

USENIX

Reviews

Start your review of Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Taught by

Pollux - Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

Cilantro - Performance-Aware Resource Allocation for General Objectives via Online Feedback

Never Stop Learning.