Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Orchestration Needs for AI Clusters at Scale - Lessons Learned from Two Leading Providers

Open Compute Project via YouTube

Overview

Learn about orchestration and operations requirements for large-scale AI clusters in this 15-minute technical talk from Supermicro and Broadcom experts. Explore real-world examples and solutions using SONiC for managing thousands of switches and tens of thousands of links at scale. Discover key considerations including accelerator vendors, InfiniBand vs Ethernet fabrics, templated scale unit designs, and switch/adapter orchestration. Master the process of translating high-level requirements into practical designs, automating Day 0 and Day 1 deployments, validating implementations, and implementing Day 2 monitoring. Gain insights into preventing configuration drift, leveraging telemetry for performance optimization, and managing multi-tenant environments effectively.

Syllabus

Orchestration needs for AI clusters at scale – Lessons learned from two leading providers

Taught by

Open Compute Project

Reviews

Start your review of Orchestration Needs for AI Clusters at Scale - Lessons Learned from Two Leading Providers

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.