Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Building Resilience for Large-Scale AI Training: GPU Management, Failure Detection, and Beyond

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore critical strategies for managing GPU failures and building resilient AI training systems in this conference talk from KubeCon + CloudNativeCon. Learn how to tackle the challenges of hardware failures when scaling AI training across thousands of GPUs and hundreds of machines. Discover effective approaches to GPU fault detection, network performance monitoring, and proactive problem identification using tools like NVIDIA DCGM. Gain insights into fault-tolerant distributed training principles that help minimize the impact of GPU failures. Drawing from real-world experience in cloud computing and large language model training, master best practices for identifying, remediating, and preventing GPU failures that can otherwise lead to increased costs and development delays. Understand why even minor performance degradation can significantly impact large-scale training jobs and how proper observability can help maintain optimal training efficiency.

Syllabus

Building Resilience for Large-Scale AI Training: GPU Man... Ganeshkumar Ashokavardhanan & Ace Eldeib

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Building Resilience for Large-Scale AI Training: GPU Management, Failure Detection, and Beyond

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.