Learn about Llumnix, a dynamic request scheduling system for large language models, in this technical conference talk from Ray Summit 2024. Explore how Alibaba Group developed this innovative system built on vLLM and Ray to tackle major challenges in LLM serving through runtime rescheduling and KV cache migration across instances. Discover the system's approach to reducing prefill latencies via cross-instance defragmentation and minimizing tail decoding latencies through load balancing and preemption reduction. Follow the research journey from concept to publication at OSDI '24 and its implementation at Alibaba. Gain practical insights into optimizing LLM serving for high-performance environments, with detailed explanations of how Ray and vLLM are utilized to address complex scheduling challenges in AI infrastructure. Understand the current capabilities of this open-source project available on GitHub and learn about its future development roadmap.
Overview
Syllabus
Dynamic Scheduling for Large Language Model Serving | Ray Summit 2024
Taught by
Anyscale