Learn how to optimize large GPU clusters for machine learning workloads in this 11-minute conference talk from Nvidia's Principal Software Research Architect. Explore why traditional data center telemetry approaches fall short for massive ML models and discover new methods for extracting meaningful metrics from large-scale clusters. Examine how ML workloads create unique patterns of similarity and synchronicity across adaptive-routed, rail-optimized, fat-tree topologies, and understand the specialized abstractions developed to identify performance optimization opportunities in ML-focused infrastructure.
New Approaches to Network Telemetry for AI Performance Optimization
Open Compute Project via YouTube
Overview
Syllabus
New approaches to network telemetry Essential for AI performance
Taught by
Open Compute Project