Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Microsoft

Scaling ML Workloads with PyTorch

Microsoft via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore best practices and techniques for scaling machine learning workloads to build large-scale models using PyTorch in this 38-minute conference talk from Microsoft Build 2022. Learn from experiences training 175-billion and 1-Trillion parameter models, covering different training paradigms and techniques for profiling and troubleshooting. Dive into topics such as PyTorch Distributed, DistributedDataParallel, FullyShardedDataParallel, pipeline parallelism, memory-saving features, and scaling efficiency. Gain insights on model implementation, scaling limits, network bandwidth impact, and best practices for large-scale training. Discover profiling and troubleshooting tools like Uber Prof, DCGM, Nvidia NSights, and PyTorch Profiler for distributed training scenarios. By the end, acquire valuable knowledge to jumpstart your efforts in scaling ML workloads with PyTorch.

Syllabus

Introduction.
Why is large model training needed?.
Scaling creates training and model efficiency.
Larger models = more efficient, less training, less data.
Larger models can learn with few shot learning.
Democratizing largescale language models with OPT175B.
Challenges of large model training.
What is PyTorch Distributed?.
Features Overview.
DistributedDataParallel.
FullyShardedDataParallel.
FSDP Auto wrapping.
FSDP Auto wrapping example.
FSDP CPU Offload, Backward Prefetch policies.
FSDP Mixed Precision control.
Pipeline.
Example Auto Partitioning.
Pipeline + DDP (PDP).
Memory Saving Features.
Activation Checkpointing.
Activation Offloading.
Activation Checkpointing & Offloading.
Parameter Offloading.
Memory Saving Feature & Training Paradigms.
Experiments & Insights.
Model Implementation.
Scaling Efficiency Varying # GPUs.
Scaling Efficiency Varying World Size.
Scaling Efficiency Varying Batch Size.
Model Scale Limit.
Impact of Network Bandwidth.
Best Practices.
Best Practices FSDP.
Profiling & Troubleshooting.
Profiling & Troubleshooting for Large Scale Model Training.
Uber Prof (Experimental) Profiling & Troubleshooting tool.
Demonstration.
Combining DCGM + Profiling.
Profiling for Large Scale Model Training.
Nvidia NSights multinode, multigpu Profiling.
PyTorch Profiler Distributed Training Profiling (single node multigpu).
Try it now.
Resources.
Closing Notes.

Taught by

Microsoft Developer

Reviews

Start your review of Scaling ML Workloads with PyTorch

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.