Ultimate Guide to Scaling ML Models - Megatron-LM - ZeRO - DeepSpeed - Mixed Precision
Aleksa Gordić - The AI Epiphany via YouTube
Overview
Syllabus
Intro to training Large ML models trillions of params!
sponsored AssemblyAI's speech transcription API
Data parallelism
Pipeline/model parallelism
Megatron-LM paper tensor/model parallelism
Splitting the MLP block vertically
Splitting the attention block vertically
Activation checkpointing
Combining data + model parallelism
Scaling is all you need and 3D parallelism
Mixed precision training paper
Single vs half vs bfloat number formats
Storing master weights in single precision
Loss scaling
Arithmetic precision matters
ZeRO optimizer paper DeepSpeed library
Partitioning is all you need?
Where did all the memory go?
Outro
Taught by
Aleksa Gordić - The AI Epiphany