Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Alibaba HPN - Data Center Network Architecture for Large Language Model Training

Open Compute Project via YouTube

Overview

Learn about Alibaba's High Performance Network (HPN) architecture in this technical presentation that explores innovative solutions for Large Language Model (LLM) training infrastructure. Discover how traditional data center networks fall short for LLM training workloads, which generate fewer but much larger data flows compared to general cloud computing. Explore the unique 2-tier, dual-plane architecture that can connect 15,000 GPUs in a single Pod, improving upon conventional 3-tier Clos designs. Examine how the dual-ToR implementation enhances reliability by eliminating single points of failure, while the architecture's design prevents hash polarization and optimizes path selection for managing elephant flows. Gain valuable insights from real-world deployment experiences and operational lessons learned from implementing HPN in production environments.

Syllabus

Alibaba HPN: A Data Center Network for Large Language Model Training

Taught by

Open Compute Project

Reviews

Start your review of Alibaba HPN - Data Center Network Architecture for Large Language Model Training

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.