Scaling LLMs on Google Cloud - Synergy Between Ray, TPU, and GKE

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Grab it

Watch a technical conference talk from Ray Summit 2024 where Google engineers Fanhai Lu and Richard Liu present an advanced serving stack for deploying Large Language Models (LLMs) at scale. Learn how to overcome key LLM deployment challenges by combining Ray's distributed computing capabilities with TPU acceleration and Google Kubernetes Engine (GKE) orchestration. Discover architectural strategies for optimizing latency and throughput, managing hardware memory constraints, and scaling cloud compute resources in production environments. Gain practical insights from real-world deployments of models like Llama 3 and explore best practices for implementing GenAI solutions on Google Cloud Platform using XLA+TPUs for computation, Ray for multi-host deployments, and GKE for TPU pod slice orchestration.

Syllabus

Scaling LLMs on Google Cloud: Synergy Between Ray, TPU, and GKE | Ray Summit 2024

Taught by

Anyscale

Reviews

Start your review of Scaling LLMs on Google Cloud - Synergy Between Ray, TPU, and GKE

Taught by

Operationalizing Large Scale Machine Learning on Cloud TPUs with Google Kubernetes Engine

Scaling Ray Train to 10,000 Kubernetes Nodes on GKE

Scalable Advanced ML Systems with Ray, Google Kubernetes Engine, and ML Accelerators

The LLM-Cloud Synergy: How In-house LLM Teams Transform Cloud Development

Accelerate PyTorch Workloads with PyTorch/XLA

Akash Network - Powering Ray on an Open Source Super Cloud

Never Stop Learning.