Distributed Multi-Node Model Inference Using the LeaderWorkerSet API
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Learn about multi-node model inference deployment in this technical conference talk from KubeCon that explores the LeaderWorkerSet API for managing large language models across distributed systems. Dive into the challenges of deploying computationally intensive LLMs like Gemini, Claude, and GPT4, which are too large for single GPU/TPU devices and require multi-node server deployment solutions. Explore how the new Kubernetes API enables efficient orchestration of state-of-the-art model servers including vLLM and JetStream across both GPU and TPU infrastructures. Master practical approaches to handling distributed processes across multiple nodes while maximizing accelerator memory utilization for optimal model performance and response times.
Syllabus
Distributed Multi-Node Model Inference Using the LeaderWorkerSet API- Abdullah Gharaibeh, Rupeng Liu
Taught by
CNCF [Cloud Native Computing Foundation]