Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn about orchestration and operations requirements for large-scale AI clusters in this 15-minute technical talk from Supermicro and Broadcom experts. Explore real-world examples and solutions using SONiC for managing thousands of switches and tens of thousands of links at scale. Discover key considerations including accelerator vendors, InfiniBand vs Ethernet fabrics, templated scale unit designs, and switch/adapter orchestration. Master the process of translating high-level requirements into practical designs, automating Day 0 and Day 1 deployments, validating implementations, and implementing Day 2 monitoring. Gain insights into preventing configuration drift, leveraging telemetry for performance optimization, and managing multi-tenant environments effectively.