Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn essential techniques for optimizing Large Language Model (LLM) inference deployments in a comprehensive conference talk that explores load balancing and autoscaling strategies on Kubernetes. Discover how to effectively integrate KServe platform for LLM deployment while maximizing GPU hardware efficiency in both public and private cloud environments. Explore critical performance concepts, including latency per token and tokens per second metrics, while gaining practical insights into leveraging KServe, Knative, and GPU operator features. Master cost-effective resource management strategies through detailed test results and analysis, enabling improved resource utilization for business-critical applications utilizing generative AI language models. Gain valuable knowledge about managing compute-intensive workloads and implementing efficient solutions for power usage optimization in Kubernetes environments.

Syllabus

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on... David Gray

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.