The Evolution of Multi-GPU Inference in vLLM

Overview

Explore the development and future of multi-GPU inference in vLLM through this Ray Summit 2024 conference talk delivered by Sangbin Cho from Anyscale and Murali Andoorveedu from Centml. Gain deep insights into the distinct challenges of distributed inference for large language models compared to distributed training. Learn about key parallelism strategies including tensor, pipeline, and expert parallelism, with detailed explanations of their mechanisms. Through a practical vLLM case study, discover how to build optimized architectures for efficient distributed inference across multiple GPUs. Understand the current state and future trajectory of scaling LLM inference, gaining valuable knowledge about this crucial aspect of AI infrastructure development.

Syllabus

The Evolution of Multi-GPU Inference in vLLM | Ray Summit 2024

Taught by

Anyscale

Reviews

Start your review of The Evolution of Multi-GPU Inference in vLLM

Taught by

Databricks' vLLM Optimization for Cost-Effective LLM Inference - Ray Summit 2024

The State of vLLM - Advancements in LLM Inference and Serving

Handshake's Approach to Content Tagging with vLLM and Anyscale

Optimizing LLM Inference with AWS Trainium, Ray, vLLM, and Anyscale

Efficient Large-Scale Language Model Training on GPU Clusters

Faster Model Serving with Ray and Anyscale - Ray Summit 2024

Never Stop Learning.