Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Network Stability Challenges and Solutions in Large-Scale Computing Clusters

Open Compute Project via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how Alibaba achieves exceptional network stability in large-scale computing clusters through a technical presentation that explores advanced monitoring and control techniques. Dive into the unique challenges posed by AI/ML training synchronization operators and discover how the 'DC as a computer' philosophy guides the implementation of unified monitoring systems. Explore high-precision flow monitoring techniques that enable sub-millisecond granularity flow-based rate statistics for identifying micro-congestion points and optimizing congestion control. Understand the implementation of Alternate Marking DSCP (A.M.D) technology and its role in detecting and locating packet loss events across the network with second-level precision.

Syllabus

The Challenges and Practices of Network Stability in Alibabas Large Scale Computing Clusters

Taught by

Open Compute Project

Reviews

Start your review of Network Stability Challenges and Solutions in Large-Scale Computing Clusters

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.