Overview
Syllabus
Introduction.
Why is large model training needed?.
Scaling creates training and model efficiency.
Larger models = more efficient, less training, less data.
Larger models can learn with few shot learning.
Democratizing largescale language models with OPT175B.
Challenges of large model training.
What is PyTorch Distributed?.
Features Overview.
DistributedDataParallel.
FullyShardedDataParallel.
FSDP Auto wrapping.
FSDP Auto wrapping example.
FSDP CPU Offload, Backward Prefetch policies.
FSDP Mixed Precision control.
Pipeline.
Example Auto Partitioning.
Pipeline + DDP (PDP).
Memory Saving Features.
Activation Checkpointing.
Activation Offloading.
Activation Checkpointing & Offloading.
Parameter Offloading.
Memory Saving Feature & Training Paradigms.
Experiments & Insights.
Model Implementation.
Scaling Efficiency Varying # GPUs.
Scaling Efficiency Varying World Size.
Scaling Efficiency Varying Batch Size.
Model Scale Limit.
Impact of Network Bandwidth.
Best Practices.
Best Practices FSDP.
Profiling & Troubleshooting.
Profiling & Troubleshooting for Large Scale Model Training.
Uber Prof (Experimental) Profiling & Troubleshooting tool.
Demonstration.
Combining DCGM + Profiling.
Profiling for Large Scale Model Training.
Nvidia NSights multinode, multigpu Profiling.
PyTorch Profiler Distributed Training Profiling (single node multigpu).
Try it now.
Resources.
Closing Notes.
Taught by
Microsoft Developer