Overview
Learn how to achieve high-performance, cost-effective inference for large language models in this 13-minute conference talk from Ray Summit 2024. Explore the powerful combination of Ray Serve and AWS Inferentia on Amazon EKS for deploying models like Llama2 and Mistral-7B. Follow along as speakers Vara Bonthu and Ratnopam Chakrabarti demonstrate building scalable inference infrastructure that overcomes GPU availability limitations. Discover how integrating Ray Serve, AWS Neuron SDK, and Karpenter autoscaler on Amazon EKS creates a flexible environment for AI workloads. Master strategies for optimizing costs while maintaining high performance, and gain insights into deploying and scaling advanced language models in production environments.
Syllabus
Scaling LLM Inference: AWS Inferentia Meets Ray Serve on EKS | Ray Summit 2024
Taught by
Anyscale