Best Practices for Deploying LLM Inference, RAG and Fine-Tuning Pipelines on Kubernetes

Overview

Learn how to effectively deploy, scale, and manage Large Language Model (LLM) inference pipelines on Kubernetes in this technical conference talk from NVIDIA experts. Discover essential best practices for implementing common patterns including inference, retrieval-augmented generation (RAG), and fine-tuning workflows. Master techniques for reducing inference latency through model caching, optimizing GPU resource utilization with efficient scheduling strategies, handling multi-GPU/node configurations, and implementing auto-quantization. Explore methods for enhancing security through Role-Based Access Control (RBAC), setting up comprehensive monitoring, configuring auto-scaling, and supporting air-gapped cluster deployments. Follow demonstrations of building flexible pipelines using both a lightweight standalone operator-pattern tool and KServe, an open-source AI inference platform. Gain practical knowledge for post-deployment management to improve the performance, efficiency, and security of LLM deployments in Kubernetes environments.

Syllabus

Best Practices for Deploying LLM Inference, RAG and Fine... Meenakshi Kaushik & Shiva Krishna Merla

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Best Practices for Deploying LLM Inference, RAG and Fine-Tuning Pipelines on Kubernetes

Taught by

Efficiently Serving LLMs

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes

Custom RAG Pipelines and LLM Fine-Tuning - A Gradient Tutorial

LLM Pipelines: Seamless Integration on Embedded Devices - Optimizing Large Language Models for Edge Computing

Best Practices and Tips for Fine-Tuning Representation Models

Load-Aware GPU Fractioning for LLM Inference on Kubernetes

9 Best Kubernetes Courses for 2024

Never Stop Learning.