Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Leveraging Open Technologies to Monitor Packet Drops in AI Cluster Fabrics

Open Compute Project via YouTube

Overview

Learn how to effectively monitor and troubleshoot packet drops in AI cluster networks through this technical talk by eBay's Director of Site Network Engineering. Discover the importance of lossless networks for optimal AI cluster performance and job completion times. Explore the development of a Telemetry and Monitoring (TAM) solution that utilizes Open Compute Project's SAI and open sFlow drop notification technologies. Understand how to implement monitoring tools that capture packet drops, generate notifications, identify drop reasons, and locate congestion points. Gain insights into eBay's experience with open networking hardware and community SONiC implementation in their data centers, while learning best practices for tuning infrastructure components including switches, NICs, and GPU servers to maintain optimal network performance.

Syllabus

Leveraging open technologies to monitor packet drops in AI cluster fabrics

Taught by

Open Compute Project

Reviews

Start your review of Leveraging Open Technologies to Monitor Packet Drops in AI Cluster Fabrics

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.