Load-Aware GPU Fractioning for LLM Inference on Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn about optimizing GPU resource allocation for Large Language Model (LLM) inference on Kubernetes in this technical conference talk from IBM researchers. Explore the challenges of efficient GPU utilization and discover an analytical approach to understanding the relationship between request loads and resource requirements. Examine how GPU compute and memory requirements for LLM inference servers like vLLM correlate with configuration parameters and key performance metrics. Master the implementation of optimal GPU fractioning at deployment time based on model characteristics and estimated workloads. Watch a demonstration of an open-source controller that automatically converts whole GPU requests into fractional requests using MIG (Multi-Instance GPU) slices, enabling improved resource density and sustainability while maintaining service level objectives.
Syllabus
Load-Aware GPU Fractioning for LLM Inference on Kubernetes - Olivier Tardieu & Yue Zhu, IBM
Taught by
CNCF [Cloud Native Computing Foundation]