Scaling LLM Inference - AWS Inferentia Meets Ray Serve on EKS

Overview

Learn how to achieve high-performance, cost-effective inference for large language models in this 13-minute conference talk from Ray Summit 2024. Explore the powerful combination of Ray Serve and AWS Inferentia on Amazon EKS for deploying models like Llama2 and Mistral-7B. Follow along as speakers Vara Bonthu and Ratnopam Chakrabarti demonstrate building scalable inference infrastructure that overcomes GPU availability limitations. Discover how integrating Ray Serve, AWS Neuron SDK, and Karpenter autoscaler on Amazon EKS creates a flexible environment for AI workloads. Master strategies for optimizing costs while maintaining high performance, and gain insights into deploying and scaling advanced language models in production environments.

Syllabus

Scaling LLM Inference: AWS Inferentia Meets Ray Serve on EKS | Ray Summit 2024

Taught by

Anyscale

Reviews

Start your review of Scaling LLM Inference - AWS Inferentia Meets Ray Serve on EKS

Taught by

Introduction to AWS Inferentia and Amazon EC2 Inf1 Instances

Optimizing LLM Inference with AWS Trainium, Ray, vLLM, and Anyscale

Scaling Inference Deployments with NVIDIA Triton Inference Server and Ray Serve

Taking Your Kubernetes Cluster to the Next Level with Karpenter, Ray and Amazon EKS

Cost Optimizing EKS Workloads with Karpenter and Open Source Tools

Deploying and Scaling Large Language Models with NVIDIA NIM on Amazon EKS

Never Stop Learning.