Overview
Syllabus
Intro
The Size of Deep Learning Model is increasing Quickly
Billon-Scale Model Training - Scale Out Large
Mixed-precision training
Limiting CPU Computation
Minimizing Communication Volume
ZeRO-Offload enables large model training , offloading data and compute to CPU
Unique Optimal Offload Strategy
ZERO-Offload Single GPU Schedule
ZERO-Offload Multi-GPUs Schedule
Optimized CPU Execution
Evaluation
Model Scale
Training Throughput - Single GPU
Training Throughput - Multiple GPUs
Throughput Scalability
One-step Delayed Parameter Update (DPU)
Conclusions
Taught by
USENIX