Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Best Practices for Deploying LLM Inference, RAG and Fine-Tuning Pipelines on Kubernetes

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how to effectively deploy, scale, and manage Large Language Model (LLM) inference pipelines on Kubernetes in this technical conference talk from NVIDIA experts. Discover essential best practices for implementing common patterns including inference, retrieval-augmented generation (RAG), and fine-tuning workflows. Master techniques for reducing inference latency through model caching, optimizing GPU resource utilization with efficient scheduling strategies, handling multi-GPU/node configurations, and implementing auto-quantization. Explore methods for enhancing security through Role-Based Access Control (RBAC), setting up comprehensive monitoring, configuring auto-scaling, and supporting air-gapped cluster deployments. Follow demonstrations of building flexible pipelines using both a lightweight standalone operator-pattern tool and KServe, an open-source AI inference platform. Gain practical knowledge for post-deployment management to improve the performance, efficiency, and security of LLM deployments in Kubernetes environments.

Syllabus

Best Practices for Deploying LLM Inference, RAG and Fine... Meenakshi Kaushik & Shiva Krishna Merla

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Best Practices for Deploying LLM Inference, RAG and Fine-Tuning Pipelines on Kubernetes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.