Ultimate Guide to Scaling ML Models - Megatron-LM - ZeRO - DeepSpeed - Mixed Precision

Overview

Dive into an extensive video tutorial on scaling machine learning models to trillions of parameters. Explore fundamental concepts behind major ML models like Meta's OPT-175B, BigScience BLOOM 176B, and OpenAI's GPT-3. Learn about data parallelism, model/pipeline parallelism, model/tensor parallelism (Megatron-LM), activation checkpointing, mixed precision training, and ZeRO (zero redundancy optimizer) from Microsoft's DeepSpeed library. Gain insights into various parallelism techniques, number formats, loss scaling, and memory management strategies essential for training large-scale ML models.

Syllabus

Intro to training Large ML models trillions of params!
sponsored AssemblyAI's speech transcription API
Data parallelism
Pipeline/model parallelism
Megatron-LM paper tensor/model parallelism
Splitting the MLP block vertically
Splitting the attention block vertically
Activation checkpointing
Combining data + model parallelism
Scaling is all you need and 3D parallelism
Mixed precision training paper
Single vs half vs bfloat number formats
Storing master weights in single precision
Loss scaling
Arithmetic precision matters
ZeRO optimizer paper DeepSpeed library
Partitioning is all you need?
Where did all the memory go?
Outro

Taught by

Aleksa Gordić - The AI Epiphany

Reviews

Start your review of Ultimate Guide to Scaling ML Models - Megatron-LM - ZeRO - DeepSpeed - Mixed Precision

Taught by

Scaling ML Workloads with PyTorch

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.