Pollux - Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

Overview

Explore a cutting-edge approach to deep learning cluster scheduling in this 14-minute conference talk from OSDI '21. Dive into Pollux, a co-adaptive cluster scheduler that optimizes goodput in deep learning environments. Learn how this innovative system simultaneously considers per-job and cluster-wide factors to improve resource allocation and utilization. Discover the novel goodput metric that combines system throughput with statistical efficiency, and understand how Pollux dynamically reassigns resources to enhance overall cluster performance. Gain insights into the system's ability to reduce average job completion times, promote fairness, and potentially lower costs in cloud environments. Examine the background of distributed deep learning, the impact of batch size on system throughput and statistical efficiency, and the key components of Pollux's cluster scheduler. Delve into the evaluation results and broader implications of this groundbreaking approach to deep learning cluster management.

Syllabus

Intro
Deep Learning Training in Shared Clusters
Example Shared-Cluster DL Training Workflow
Pollux: Co-adaptive Cluster Scheduler for DL
Outline
Background: Distributed DL (Data Parallelism)
System Throughput and Impact of Batch Size
Statistical Efficiency and Impact of Batch Size
illustration of Overall Training Performance
Implications for Cluster Scheduling
Pollux Cluster Scheduler
Key Idea: Goodput, not Throughput
Modeling System Throughput
Modeling Statistical Efficiency
Optimizing Cluster-Wide Allocations
Evaluation of Pollux
Cluster-Wide Statistical Efficiency
More Experiments in our Paper!
Conclusion

Taught by

USENIX

Reviews

Start your review of Pollux - Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

Taught by

Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Themis - Fair and Efficient GPU Cluster Scheduling

AntMan - Dynamic Scaling on GPU Cluster for Deep Learning

Ansor - Generating High-Performance Tensor Programs for Deep Learning

AlpaServe - Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Shockwave - Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

10 Best Deep Learning Courses for 2024

Never Stop Learning.