Overview
Learn about distributed training fundamentals in machine learning through a comprehensive MIT lecture that explores parallelization methods, data parallelism, and communication primitives. Dive deep into memory reduction techniques like ZeRO and FSDP, while understanding pipeline parallelism, tensor parallelism, and sequence parallelism. Professor Song Han delivers this 70-minute lecture covering essential background, motivation, and various parallelization approaches for training large-scale machine learning models efficiently across distributed systems.
Syllabus
EfficientML.ai Lecture 19 - Distributed Training Part 1 (MIT 6.5940, Fall 2024)
Taught by
MIT HAN Lab