MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs

Overview

Explore the groundbreaking research on scaling large language model (LLM) training to over 10,000 GPUs in this conference talk from NSDI '24. Dive into the design, implementation, and engineering challenges of MegaScale, a production system developed by ByteDance and Peking University researchers. Learn about the full-stack approach that co-designs algorithmic and system components to address unprecedented challenges in training efficiency and stability. Discover innovative techniques for model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline improvements, and network performance tuning. Gain insights into maintaining high efficiency throughout long-duration LLM training jobs and the importance of in-depth observability to tackle hard stability issues that emerge at large scale. Examine the set of diagnosis tools developed to monitor system components, identify root causes, and implement effective fault tolerance and straggler mitigation techniques. Understand how MegaScale achieves a 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, surpassing Megatron-LM by 1.34x. Benefit from the operational experience shared in identifying and fixing failures and stragglers, and gain valuable insights for future LLM systems research.