Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Explore fault tolerance strategies for GPU-accelerated AI workloads in Kubernetes through this conference talk from NVIDIA experts. Learn how to address common hardware failures in K8s-based ML platforms, including GPU malfunctions, network disruptions, ECC errors, and OOM events that lead to resource underutilization and increased operational costs. Discover various fault tolerance approaches, from basic job restart mechanisms to sophisticated solutions using hot spares and faulty node replacement. Gain insights into implementing fault propagation through node and pod conditions, while understanding the current limitations in the Kubernetes ecosystem for fault detection. Master techniques for enhancing components like the node-problem-detector and implementing new solutions for comprehensive fault detection, propagation, reaction, and remediation in AI/ML workloads.

Syllabus

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - A. Singh & A. Paithankar

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.