Mastering GPU Management in Kubernetes Using the Operator Pattern

Overview

Explore GPU management in Kubernetes using the operator pattern in this informative conference talk. Discover how Kubernetes has evolved into an ideal platform for supporting the lifecycle of AI and ML workloads, particularly large language models (LLMs). Learn about the four phases of managing GPUs in a Kubernetes cluster, including software stack installation, infrastructure expansion, lifecycle management, and monitoring. Gain insights into leveraging the operator pattern for efficient GPU software lifecycle management in Kubernetes. Watch a demonstration of the NVIDIA GPU Operator to understand how this approach benefits Kubernetes administrators, from basic driver installation to managing advanced AI/ML use cases. Enhance your knowledge of cloud native computing and GPU management techniques in this 48-minute presentation by Shiva Krishna Merla and Kevin Klues from NVIDIA.