Efficiently Serving Large Language Models - Optimizing Performance and Resource Management
Centre for Networked Intelligence, IISc via YouTube
Overview
Explore the challenges and solutions in efficiently serving Large Language Models (LLMs) in this technical talk by Microsoft Research India's Senior Researcher Dr. Ashish Panwar. Gain insights into why LLM deployment requires multiple GPUs per replica despite low resource utilization, and discover cutting-edge research from Microsoft addressing these efficiency challenges. Learn about innovative solutions like Sarathi-Serve [OSDI'24] and vAttention [ASPLOS'25], which tackle fundamental scheduling and memory management issues in LLM serving systems. Understand the current landscape of LLM deployment across applications such as chatbots, search, and code assistants, while diving into the technical complexities of making these systems more resource-efficient and cost-effective.
Syllabus
Time: 5:00 PM - PM IST
Taught by
Centre for Networked Intelligence, IISc