Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Explore fault tolerance strategies for GPU-accelerated AI workloads in Kubernetes through this conference talk from NVIDIA experts. Learn how to address common hardware failures in K8s-based ML platforms, including GPU malfunctions, network disruptions, ECC errors, and OOM events that lead to resource underutilization and increased operational costs. Discover various fault tolerance approaches, from basic job restart mechanisms to sophisticated solutions using hot spares and faulty node replacement. Gain insights into implementing fault propagation through node and pod conditions, while understanding the current limitations in the Kubernetes ecosystem for fault detection. Master techniques for enhancing components like the node-problem-detector and implementing new solutions for comprehensive fault detection, propagation, reaction, and remediation in AI/ML workloads.
Syllabus
Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - A. Singh & A. Paithankar
Taught by
CNCF [Cloud Native Computing Foundation]