Navigating Failures in Pods with Devices: Challenges and Solutions
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Explore device management complexities in Kubernetes pods through this conference talk that delves into challenges beyond basic CPU and memory allocation. Learn how to handle GPU provisioning, network card management, and specialized device placement requirements while understanding potential edge cases that may arise in Kubernetes environments. Gain valuable insights from Node Maintainers about current system limitations, particularly relevant for AI/ML workloads that demand sophisticated device configurations. Whether new to AI/ML deployments or an experienced practitioner, discover critical considerations for device management, common failure scenarios, and upcoming Kubernetes improvements designed to address these challenges. Participate in shaping future solutions by understanding the current landscape of pod-device interactions and providing feedback on proposed fixes.
Syllabus
Navigating Failures in Pods with Devices: Challenges and Solutions - Sergey Kanzhelev & Mrunal Patel
Taught by
CNCF [Cloud Native Computing Foundation]