Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Dynamic Scheduling for Large Language Model Serving

Anyscale via YouTube

Overview

Learn about Llumnix, a dynamic request scheduling system for large language models, in this technical conference talk from Ray Summit 2024. Explore how Alibaba Group developed this innovative system built on vLLM and Ray to tackle major challenges in LLM serving through runtime rescheduling and KV cache migration across instances. Discover the system's approach to reducing prefill latencies via cross-instance defragmentation and minimizing tail decoding latencies through load balancing and preemption reduction. Follow the research journey from concept to publication at OSDI '24 and its implementation at Alibaba. Gain practical insights into optimizing LLM serving for high-performance environments, with detailed explanations of how Ray and vLLM are utilized to address complex scheduling challenges in AI infrastructure. Understand the current capabilities of this open-source project available on GitHub and learn about its future development roadmap.

Syllabus

Dynamic Scheduling for Large Language Model Serving | Ray Summit 2024

Taught by

Anyscale

Reviews

Start your review of Dynamic Scheduling for Large Language Model Serving

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.