Efficient Large-Scale Language Model Training on GPU Clusters

Overview

Explore efficient large-scale language model training on GPU clusters in this 23-minute video from Databricks. Learn about the challenges of training massive models, including GPU memory limitations and lengthy computation times. Discover how to combine tensor, pipeline, and data parallelism methods to scale training to thousands of GPUs, enabling a hundredfold increase in model size capacity. Examine a novel pipeline parallelism schedule that boosts throughput by over 10% compared to existing approaches. Gain insights into the trade-offs between different parallelism techniques and how to optimize distributed training configurations. See how these combined methods achieve 502 petaFLOP/s performance on a 1 trillion parameter model using 3072 GPUs, with 52% of peak per-GPU throughput. Access the open-source code and understand the implementation details for domain-specific optimizations and improved GPU utilization.

Syllabus

Introduction
GPU Cluster
Model Training Graph
Training
Idle Periods
Pipelining
Pipeline Bubble
Tradeoffs
Interleave Schedule
Results
Hyperparameters
DomainSpecific Optimization
GPU throughput
Implementation
Conclusion

Taught by

Databricks

Reviews

Start your review of Efficient Large-Scale Language Model Training on GPU Clusters

Taught by

Create a Large Language Model from Scratch with Python – Tutorial

The Evolution of Multi-GPU Inference in vLLM

ZeRO-Offload - Democratizing Billion-Scale Model Training

Generalized Pipeline Parallelism for DNN Training - PipeDream System Overview

Distributed TensorFlow Training - Google I/O 2018

Running BLOOM 176B LLM Inference with AWS ML and DeepSpeed

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.