Source Routing for AI Fabrics - Optimizing Network Traffic in Multi-tenant AI Clusters
Open Compute Project via YouTube
Overview
Learn about an innovative approach to scheduling AI workloads in Ethernet fabrics through this technical presentation from Marvell experts Kishore Atreya and Prathyaya Bhandarkar. Explore how source routing frameworks can address challenges in large-scale, multi-tenant AI clusters where high tail latency and jitter impact training performance. Discover a simplified solution that leverages SAI to predetermine flow paths and program them across access nodes, taking advantage of AI training flow predictability. Examine how software controllers can engineer traffic flows between training elements to optimize bandwidth utilization, load, and latency, ultimately reducing network costs and power requirements compared to traditional fabric scheduling approaches. Gain insights into addressing congestion avoidance in AI infrastructure while avoiding the complexity and unpredictable behavior of alternative solutions like enhanced congestion control, load balancing, packet spraying and fabric scheduling.
Syllabus
Source Routing for AI Fabrics
Taught by
Open Compute Project