Overview
Syllabus
Intro
Deep Learning Training in Shared Clusters
Example Shared-Cluster DL Training Workflow
Pollux: Co-adaptive Cluster Scheduler for DL
Outline
Background: Distributed DL (Data Parallelism)
System Throughput and Impact of Batch Size
Statistical Efficiency and Impact of Batch Size
illustration of Overall Training Performance
Implications for Cluster Scheduling
Pollux Cluster Scheduler
Key Idea: Goodput, not Throughput
Modeling System Throughput
Modeling Statistical Efficiency
Optimizing Cluster-Wide Allocations
Evaluation of Pollux
Cluster-Wide Statistical Efficiency
More Experiments in our Paper!
Conclusion
Taught by
USENIX