Network Stability Challenges and Solutions in Large-Scale Computing Clusters
Open Compute Project via YouTube
Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how Alibaba achieves exceptional network stability in large-scale computing clusters through a technical presentation that explores advanced monitoring and control techniques. Dive into the unique challenges posed by AI/ML training synchronization operators and discover how the 'DC as a computer' philosophy guides the implementation of unified monitoring systems. Explore high-precision flow monitoring techniques that enable sub-millisecond granularity flow-based rate statistics for identifying micro-congestion points and optimizing congestion control. Understand the implementation of Alternate Marking DSCP (A.M.D) technology and its role in detecting and locating packet loss events across the network with second-level precision.
Syllabus
The Challenges and Practices of Network Stability in Alibabas Large Scale Computing Clusters
Taught by
Open Compute Project