Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

The Evolution of Multi-GPU Inference in vLLM

Anyscale via YouTube

Overview

Explore the development and future of multi-GPU inference in vLLM through this Ray Summit 2024 conference talk delivered by Sangbin Cho from Anyscale and Murali Andoorveedu from Centml. Gain deep insights into the distinct challenges of distributed inference for large language models compared to distributed training. Learn about key parallelism strategies including tensor, pipeline, and expert parallelism, with detailed explanations of their mechanisms. Through a practical vLLM case study, discover how to build optimized architectures for efficient distributed inference across multiple GPUs. Understand the current state and future trajectory of scaling LLM inference, gaining valuable knowledge about this crucial aspect of AI infrastructure development.

Syllabus

The Evolution of Multi-GPU Inference in vLLM | Ray Summit 2024

Taught by

Anyscale

Reviews

Start your review of The Evolution of Multi-GPU Inference in vLLM

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.