New Approaches to Network Telemetry for AI Performance Optimization
Open Compute Project via YouTube
Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how to optimize large GPU clusters for machine learning workloads in this 11-minute conference talk from Nvidia's Principal Software Research Architect. Explore why traditional data center telemetry approaches fall short for massive ML models and discover new methods for extracting meaningful metrics from large-scale clusters. Examine how ML workloads create unique patterns of similarity and synchronicity across adaptive-routed, rail-optimized, fat-tree topologies, and understand the specialized abstractions developed to identify performance optimization opportunities in ML-focused infrastructure.
Syllabus
New approaches to network telemetry Essential for AI performance
Taught by
Open Compute Project