Insights from Production Scheduled Ethernet Fabric in Large AI Training Clusters

Overview

Explore a technical presentation from ByteDance and Broadcom experts examining the implementation of Scheduled Ethernet Fabric in large-scale AI training clusters. Dive into the networking challenges faced when scaling xPU clusters from thousands to potentially millions of units, with a focus on ByteDance's production deployment supporting over 1000 xPUs. Learn how Scheduled Ethernet Fabric addresses critical performance aspects including congestion management, link failure recovery, load balancing, and noise isolation. Gain valuable insights into operational considerations, compatibility issues, and key differences from traditional non-Scheduled fabrics. Discover proposed standardization efforts and opportunities for collaboration within the Scheduled Fabric ecosystem to advance this technology for massive AI computational infrastructures.

Syllabus

Insights from Production Scheduled Ethernet Fabric in Large AI Training Clusters

Taught by

Open Compute Project

Reviews

Start your review of Insights from Production Scheduled Ethernet Fabric in Large AI Training Clusters

Taught by

Scheduled Ethernet Fabric for Large-Scale AI Training Clusters

Congestion Management in Ethernet-Based Networks for AI Cluster Fabric

Next Generation Fabric for AI/ML Use Cases - Disaggregated Scheduled Fabric Using SONiC

High-Performance Ethernet NIC for AI/ML - Thor 2 Architecture and Features

Advancing SONiC for AI - Network Architecture and Performance Optimization

Source Routing for AI Fabrics - Optimizing Network Traffic in Multi-tenant AI Clusters

Never Stop Learning.