Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how to effectively monitor and troubleshoot packet drops in AI cluster networks through this technical talk by eBay's Director of Site Network Engineering. Discover the importance of lossless networks for optimal AI cluster performance and job completion times. Explore the development of a Telemetry and Monitoring (TAM) solution that utilizes Open Compute Project's SAI and open sFlow drop notification technologies. Understand how to implement monitoring tools that capture packet drops, generate notifications, identify drop reasons, and locate congestion points. Gain insights into eBay's experience with open networking hardware and community SONiC implementation in their data centers, while learning best practices for tuning infrastructure components including switches, NICs, and GPU servers to maintain optimal network performance.