New Approaches to Network Telemetry for AI Performance Optimization

Overview

Learn how to optimize large GPU clusters for machine learning workloads in this 11-minute conference talk from Nvidia's Principal Software Research Architect. Explore why traditional data center telemetry approaches fall short for massive ML models and discover new methods for extracting meaningful metrics from large-scale clusters. Examine how ML workloads create unique patterns of similarity and synchronicity across adaptive-routed, rail-optimized, fat-tree topologies, and understand the specialized abstractions developed to identify performance optimization opportunities in ML-focused infrastructure.

Syllabus

New approaches to network telemetry Essential for AI performance

Taught by

Open Compute Project

Reviews

Start your review of New Approaches to Network Telemetry for AI Performance Optimization

Taught by

Applying AI/ML Methods to Diagnose Network Issues Using Telemetry Data

Demystifying AI and ML Infrastructure for Network Engineers

A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

Telemetry-Based Load Balancing of AI/ML Workloads in Self-Healing Networks

Advancing SONiC for AI - Network Architecture and Performance Optimization

Google Cloud Network Infrastructure for AI and Machine Learning

100+ Free Online Courses and Webinars on Artificial Intelligence in Healthcare

AI for Everyone: 10 Best Free Artificial Intelligence Courses for 2024

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.