Telemetry-Based Load Balancing of AI/ML Workloads in Self-Healing Networks
Open Compute Project via YouTube
Overview
Learn how Tencent implemented a self-healing network for AI/ML workloads in this 19-minute technical presentation from Broadcom experts. Explore the unique challenges of AI/ML network traffic, which differs from traditional workloads by having fewer flows that consume significant bandwidth and quickly saturate links while requiring lossless fabric and low latency. Discover how Ethernet-based technologies and the SAI/SONiC ecosystem are being utilized alongside Broadcom's innovative networking solutions to maintain optimal performance. Gain insights into the implementation of In-band telemetry and packet drop monitoring capabilities, and understand how applications leverage granular network telemetry data to dynamically optimize load balancing for AI/ML workload flows.
Syllabus
Telemetry based load balancing of AI/ML workloads
Taught by
Open Compute Project