Orchestration Needs for AI Clusters at Scale - Lessons Learned from Two Leading Providers
Open Compute Project via YouTube
Overview
Learn about orchestration and operations requirements for large-scale AI clusters in this 15-minute technical talk from Supermicro and Broadcom experts. Explore real-world examples and solutions using SONiC for managing thousands of switches and tens of thousands of links at scale. Discover key considerations including accelerator vendors, InfiniBand vs Ethernet fabrics, templated scale unit designs, and switch/adapter orchestration. Master the process of translating high-level requirements into practical designs, automating Day 0 and Day 1 deployments, validating implementations, and implementing Day 2 monitoring. Gain insights into preventing configuration drift, leveraging telemetry for performance optimization, and managing multi-tenant environments effectively.
Syllabus
Orchestration needs for AI clusters at scale – Lessons learned from two leading providers
Taught by
Open Compute Project