Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Load-Aware GPU Fractioning for LLM Inference on Kubernetes

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn about optimizing GPU resource allocation for Large Language Model (LLM) inference on Kubernetes in this technical conference talk from IBM researchers. Explore the challenges of efficient GPU utilization and discover an analytical approach to understanding the relationship between request loads and resource requirements. Examine how GPU compute and memory requirements for LLM inference servers like vLLM correlate with configuration parameters and key performance metrics. Master the implementation of optimal GPU fractioning at deployment time based on model characteristics and estimated workloads. Watch a demonstration of an open-source controller that automatically converts whole GPU requests into fractional requests using MIG (Multi-Instance GPU) slices, enabling improved resource density and sustainability while maintaining service level objectives.

Syllabus

Load-Aware GPU Fractioning for LLM Inference on Kubernetes - Olivier Tardieu & Yue Zhu, IBM

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Load-Aware GPU Fractioning for LLM Inference on Kubernetes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.