Scaling ML Workloads with PyTorch

Scaling ML Workloads with PyTorch

Microsoft Developer via YouTube Direct link

Combining DCGM + Profiling

38 of 44

38 of 44

Combining DCGM + Profiling

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Scaling ML Workloads with PyTorch

Automatically move to the next video in the Classroom when playback concludes

  1. 1 Introduction
  2. 2 Why is large model training needed?
  3. 3 Scaling creates training and model efficiency
  4. 4 Larger models = more efficient, less training, less data
  5. 5 Larger models can learn with few shot learning
  6. 6 Democratizing largescale language models with OPT175B
  7. 7 Challenges of large model training
  8. 8 What is PyTorch Distributed?
  9. 9 Features Overview
  10. 10 DistributedDataParallel
  11. 11 FullyShardedDataParallel
  12. 12 FSDP Auto wrapping
  13. 13 FSDP Auto wrapping example
  14. 14 FSDP CPU Offload, Backward Prefetch policies
  15. 15 FSDP Mixed Precision control
  16. 16 Pipeline
  17. 17 Example Auto Partitioning
  18. 18 Pipeline + DDP (PDP)
  19. 19 Memory Saving Features
  20. 20 Activation Checkpointing
  21. 21 Activation Offloading
  22. 22 Activation Checkpointing & Offloading
  23. 23 Parameter Offloading
  24. 24 Memory Saving Feature & Training Paradigms
  25. 25 Experiments & Insights
  26. 26 Model Implementation
  27. 27 Scaling Efficiency Varying # GPUs
  28. 28 Scaling Efficiency Varying World Size
  29. 29 Scaling Efficiency Varying Batch Size
  30. 30 Model Scale Limit
  31. 31 Impact of Network Bandwidth
  32. 32 Best Practices
  33. 33 Best Practices FSDP
  34. 34 Profiling & Troubleshooting
  35. 35 Profiling & Troubleshooting for Large Scale Model Training
  36. 36 Uber Prof (Experimental) Profiling & Troubleshooting tool
  37. 37 Demonstration
  38. 38 Combining DCGM + Profiling
  39. 39 Profiling for Large Scale Model Training
  40. 40 Nvidia NSights multinode, multigpu Profiling
  41. 41 PyTorch Profiler Distributed Training Profiling (single node multigpu)
  42. 42 Try it now
  43. 43 Resources
  44. 44 Closing Notes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.