Efficiently Serving Large Language Models - Optimizing Performance and Resource Management

Overview

Explore the challenges and solutions in efficiently serving Large Language Models (LLMs) in this technical talk by Microsoft Research India's Senior Researcher Dr. Ashish Panwar. Gain insights into why LLM deployment requires multiple GPUs per replica despite low resource utilization, and discover cutting-edge research from Microsoft addressing these efficiency challenges. Learn about innovative solutions like Sarathi-Serve [OSDI'24] and vAttention [ASPLOS'25], which tackle fundamental scheduling and memory management issues in LLM serving systems. Understand the current landscape of LLM deployment across applications such as chatbots, search, and code assistants, while diving into the technical complexities of making these systems more resource-efficient and cost-effective.

Syllabus

Time: 5:00 PM - PM IST

Taught by

Centre for Networked Intelligence, IISc

Reviews

Start your review of Efficiently Serving Large Language Models - Optimizing Performance and Resource Management

Taught by

Fairness in Serving Large Language Models

USHER: Holistic Interference Avoidance for Resource Optimized ML Inference

LoRAX - Serving Thousands of Fine-Tuned LLMs on a Single GPU

Dynamic Scheduling for Large Language Model Serving

AI Deployment: Mastering LLMs with KFServing in Kubernetes

Unleashing the Power of Dynamic Resource Allocation for Just-in-Time GPU Slicing

Never Stop Learning.