Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the development and future of multi-GPU inference in vLLM through this Ray Summit 2024 conference talk delivered by Sangbin Cho from Anyscale and Murali Andoorveedu from Centml. Gain deep insights into the distinct challenges of distributed inference for large language models compared to distributed training. Learn about key parallelism strategies including tensor, pipeline, and expert parallelism, with detailed explanations of their mechanisms. Through a practical vLLM case study, discover how to build optimized architectures for efficient distributed inference across multiple GPUs. Understand the current state and future trajectory of scaling LLM inference, gaining valuable knowledge about this crucial aspect of AI infrastructure development.