Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Watch a technical conference talk from Ray Summit 2024 where Google engineers Fanhai Lu and Richard Liu present an advanced serving stack for deploying Large Language Models (LLMs) at scale. Learn how to overcome key LLM deployment challenges by combining Ray's distributed computing capabilities with TPU acceleration and Google Kubernetes Engine (GKE) orchestration. Discover architectural strategies for optimizing latency and throughput, managing hardware memory constraints, and scaling cloud compute resources in production environments. Gain practical insights from real-world deployments of models like Llama 3 and explore best practices for implementing GenAI solutions on Google Cloud Platform using XLA+TPUs for computation, Ray for multi-host deployments, and GKE for TPU pod slice orchestration.
Syllabus
Scaling LLMs on Google Cloud: Synergy Between Ray, TPU, and GKE | Ray Summit 2024
Taught by
Anyscale