Best Practices for Deploying LLM Inference, RAG and Fine-Tuning Pipelines on Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how to effectively deploy, scale, and manage Large Language Model (LLM) inference pipelines on Kubernetes in this technical conference talk from NVIDIA experts. Discover essential best practices for implementing common patterns including inference, retrieval-augmented generation (RAG), and fine-tuning workflows. Master techniques for reducing inference latency through model caching, optimizing GPU resource utilization with efficient scheduling strategies, handling multi-GPU/node configurations, and implementing auto-quantization. Explore methods for enhancing security through Role-Based Access Control (RBAC), setting up comprehensive monitoring, configuring auto-scaling, and supporting air-gapped cluster deployments. Follow demonstrations of building flexible pipelines using both a lightweight standalone operator-pattern tool and KServe, an open-source AI inference platform. Gain practical knowledge for post-deployment management to improve the performance, efficiency, and security of LLM deployments in Kubernetes environments.
Syllabus
Best Practices for Deploying LLM Inference, RAG and Fine... Meenakshi Kaushik & Shiva Krishna Merla
Taught by
CNCF [Cloud Native Computing Foundation]