Scheduled Ethernet Fabric for Large-Scale AI Training Clusters

Overview

Watch a technical presentation from ByteDance and Broadcom architects exploring how Scheduled Ethernet Fabric technology optimizes large-scale AI training clusters. Discover the architecture behind connecting tens of thousands of GPUs efficiently, with detailed insights into achieving extensive GPU scale-out, managing diverse parallel workloads through multi-tenancy, and implementing resilient networking against failures. Learn from ByteDance's real-world benchmarking results and deployment experiences with this fabric technology, while gaining perspective on the importance of open ecosystems for continued innovation in AI infrastructure. Understand key requirements for high-performance network fabrics that maximize computational power across massive GPU clusters handling various AI workloads.

Syllabus

Scheduled Ethernet Fabric for Large scale AI training cluster

Taught by

Open Compute Project

Reviews

Start your review of Scheduled Ethernet Fabric for Large-Scale AI Training Clusters

Taught by

Insights from Production Scheduled Ethernet Fabric in Large AI Training Clusters

Next Generation Fabric for AI/ML Use Cases - Disaggregated Scheduled Fabric Using SONiC

Congestion Management in Ethernet-Based Networks for AI Cluster Fabric

High-Performance Ethernet NIC for AI/ML - Thor 2 Architecture and Features

Orchestration Needs for AI Clusters at Scale - Lessons Learned from Two Leading Providers

Exploring the Wilderness - Optimizing Ethernet Fabrics for AI Workloads

Never Stop Learning.