Load-Aware GPU Fractioning for LLM Inference on Kubernetes

Overview

Learn about optimizing GPU resource allocation for Large Language Model (LLM) inference on Kubernetes in this technical conference talk from IBM researchers. Explore the challenges of efficient GPU utilization and discover an analytical approach to understanding the relationship between request loads and resource requirements. Examine how GPU compute and memory requirements for LLM inference servers like vLLM correlate with configuration parameters and key performance metrics. Master the implementation of optimal GPU fractioning at deployment time based on model characteristics and estimated workloads. Watch a demonstration of an open-source controller that automatically converts whole GPU requests into fractional requests using MIG (Multi-Instance GPU) slices, enabling improved resource density and sustainability while maintaining service level objectives.

Syllabus

Load-Aware GPU Fractioning for LLM Inference on Kubernetes - Olivier Tardieu & Yue Zhu, IBM

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Load-Aware GPU Fractioning for LLM Inference on Kubernetes

Taught by

Incremental GPU Slicing in Kubernetes Clusters - Dynamic Resource Management

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes

AI Inference Workloads - Solving MLOps Challenges in Production

Trimaran: Real Load Aware Scheduling in Kubernetes

The Evolution of Multi-GPU Inference in vLLM

Best Practices for Deploying LLM Inference, RAG and Fine-Tuning Pipelines on Kubernetes

9 Best Kubernetes Courses for 2024

Never Stop Learning.