Scaling ML Workloads with PyTorch

Overview

Explore best practices and techniques for scaling machine learning workloads to build large-scale models using PyTorch in this 38-minute conference talk from Microsoft Build 2022. Learn from experiences training 175-billion and 1-Trillion parameter models, covering different training paradigms and techniques for profiling and troubleshooting. Dive into topics such as PyTorch Distributed, DistributedDataParallel, FullyShardedDataParallel, pipeline parallelism, memory-saving features, and scaling efficiency. Gain insights on model implementation, scaling limits, network bandwidth impact, and best practices for large-scale training. Discover profiling and troubleshooting tools like Uber Prof, DCGM, Nvidia NSights, and PyTorch Profiler for distributed training scenarios. By the end, acquire valuable knowledge to jumpstart your efforts in scaling ML workloads with PyTorch.

Syllabus

Introduction.
Why is large model training needed?.
Scaling creates training and model efficiency.
Larger models = more efficient, less training, less data.
Larger models can learn with few shot learning.
Democratizing largescale language models with OPT175B.
Challenges of large model training.
What is PyTorch Distributed?.
Features Overview.
DistributedDataParallel.
FullyShardedDataParallel.
FSDP Auto wrapping.
FSDP Auto wrapping example.
FSDP CPU Offload, Backward Prefetch policies.
FSDP Mixed Precision control.
Pipeline.
Example Auto Partitioning.
Pipeline + DDP (PDP).
Memory Saving Features.
Activation Checkpointing.
Activation Offloading.
Activation Checkpointing & Offloading.
Parameter Offloading.
Memory Saving Feature & Training Paradigms.
Experiments & Insights.
Model Implementation.
Scaling Efficiency Varying # GPUs.
Scaling Efficiency Varying World Size.
Scaling Efficiency Varying Batch Size.
Model Scale Limit.
Impact of Network Bandwidth.
Best Practices.
Best Practices FSDP.
Profiling & Troubleshooting.
Profiling & Troubleshooting for Large Scale Model Training.
Uber Prof (Experimental) Profiling & Troubleshooting tool.
Demonstration.
Combining DCGM + Profiling.
Profiling for Large Scale Model Training.
Nvidia NSights multinode, multigpu Profiling.
PyTorch Profiler Distributed Training Profiling (single node multigpu).
Try it now.
Resources.
Closing Notes.