Alibaba HPN - Data Center Network Architecture for Large Language Model Training
Open Compute Project via YouTube
Overview
Learn about Alibaba's High Performance Network (HPN) architecture in this technical presentation that explores innovative solutions for Large Language Model (LLM) training infrastructure. Discover how traditional data center networks fall short for LLM training workloads, which generate fewer but much larger data flows compared to general cloud computing. Explore the unique 2-tier, dual-plane architecture that can connect 15,000 GPUs in a single Pod, improving upon conventional 3-tier Clos designs. Examine how the dual-ToR implementation enhances reliability by eliminating single points of failure, while the architecture's design prevents hash polarization and optimizes path selection for managing elephant flows. Gain valuable insights from real-world deployment experiences and operational lessons learned from implementing HPN in production environments.
Syllabus
Alibaba HPN: A Data Center Network for Large Language Model Training
Taught by
Open Compute Project