Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn how to achieve high-performance, cost-effective inference for large language models in this 13-minute conference talk from Ray Summit 2024. Explore the powerful combination of Ray Serve and AWS Inferentia on Amazon EKS for deploying models like Llama2 and Mistral-7B. Follow along as speakers Vara Bonthu and Ratnopam Chakrabarti demonstrate building scalable inference infrastructure that overcomes GPU availability limitations. Discover how integrating Ray Serve, AWS Neuron SDK, and Karpenter autoscaler on Amazon EKS creates a flexible environment for AI workloads. Master strategies for optimizing costs while maintaining high performance, and gain insights into deploying and scaling advanced language models in production environments.