Completed
Combining DCGM + Profiling
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Scaling ML Workloads with PyTorch
Automatically move to the next video in the Classroom when playback concludes
- 1 Introduction
- 2 Why is large model training needed?
- 3 Scaling creates training and model efficiency
- 4 Larger models = more efficient, less training, less data
- 5 Larger models can learn with few shot learning
- 6 Democratizing largescale language models with OPT175B
- 7 Challenges of large model training
- 8 What is PyTorch Distributed?
- 9 Features Overview
- 10 DistributedDataParallel
- 11 FullyShardedDataParallel
- 12 FSDP Auto wrapping
- 13 FSDP Auto wrapping example
- 14 FSDP CPU Offload, Backward Prefetch policies
- 15 FSDP Mixed Precision control
- 16 Pipeline
- 17 Example Auto Partitioning
- 18 Pipeline + DDP (PDP)
- 19 Memory Saving Features
- 20 Activation Checkpointing
- 21 Activation Offloading
- 22 Activation Checkpointing & Offloading
- 23 Parameter Offloading
- 24 Memory Saving Feature & Training Paradigms
- 25 Experiments & Insights
- 26 Model Implementation
- 27 Scaling Efficiency Varying # GPUs
- 28 Scaling Efficiency Varying World Size
- 29 Scaling Efficiency Varying Batch Size
- 30 Model Scale Limit
- 31 Impact of Network Bandwidth
- 32 Best Practices
- 33 Best Practices FSDP
- 34 Profiling & Troubleshooting
- 35 Profiling & Troubleshooting for Large Scale Model Training
- 36 Uber Prof (Experimental) Profiling & Troubleshooting tool
- 37 Demonstration
- 38 Combining DCGM + Profiling
- 39 Profiling for Large Scale Model Training
- 40 Nvidia NSights multinode, multigpu Profiling
- 41 PyTorch Profiler Distributed Training Profiling (single node multigpu)
- 42 Try it now
- 43 Resources
- 44 Closing Notes