Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

New Approaches to Network Telemetry for AI Performance Optimization

Open Compute Project via YouTube

Overview

Learn how to optimize large GPU clusters for machine learning workloads in this 11-minute conference talk from Nvidia's Principal Software Research Architect. Explore why traditional data center telemetry approaches fall short for massive ML models and discover new methods for extracting meaningful metrics from large-scale clusters. Examine how ML workloads create unique patterns of similarity and synchronicity across adaptive-routed, rail-optimized, fat-tree topologies, and understand the specialized abstractions developed to identify performance optimization opportunities in ML-focused infrastructure.

Syllabus

New approaches to network telemetry Essential for AI performance

Taught by

Open Compute Project

Reviews

Start your review of New Approaches to Network Telemetry for AI Performance Optimization

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.