Distributed Multi-Node Model Inference Using the LeaderWorkerSet API

Overview

Learn about multi-node model inference deployment in this technical conference talk from KubeCon that explores the LeaderWorkerSet API for managing large language models across distributed systems. Dive into the challenges of deploying computationally intensive LLMs like Gemini, Claude, and GPT4, which are too large for single GPU/TPU devices and require multi-node server deployment solutions. Explore how the new Kubernetes API enables efficient orchestration of state-of-the-art model servers including vLLM and JetStream across both GPU and TPU infrastructures. Master practical approaches to handling distributed processes across multiple nodes while maximizing accelerator memory utilization for optimal model performance and response times.