Dynamic Scheduling for Large Language Model Serving

Overview

Learn about Llumnix, a dynamic request scheduling system for large language models, in this technical conference talk from Ray Summit 2024. Explore how Alibaba Group developed this innovative system built on vLLM and Ray to tackle major challenges in LLM serving through runtime rescheduling and KV cache migration across instances. Discover the system's approach to reducing prefill latencies via cross-instance defragmentation and minimizing tail decoding latencies through load balancing and preemption reduction. Follow the research journey from concept to publication at OSDI '24 and its implementation at Alibaba. Gain practical insights into optimizing LLM serving for high-performance environments, with detailed explanations of how Ray and vLLM are utilized to address complex scheduling challenges in AI infrastructure. Understand the current capabilities of this open-source project available on GitHub and learn about its future development roadmap.

Syllabus

Dynamic Scheduling for Large Language Model Serving | Ray Summit 2024

Taught by

Anyscale

Reviews

Start your review of Dynamic Scheduling for Large Language Model Serving

Taught by

The State of vLLM - Advancements in LLM Inference and Serving

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Building Intelligent AI Infrastructure with ORI - Dynamic Query Routing and Model Management

Orca - A Distributed Serving System for Transformer-Based Generative Models

Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design

Never Stop Learning.