Operationalizing High-Performance GPU Clusters in Kubernetes - Lessons Learned from Training Databricks DBRX

Overview

Explore a technical conference talk that delves into the challenges and solutions of operating high-performance GPU clusters within Kubernetes environments, specifically focusing on the training of Databricks DBRX. Learn how to manage a 400-node cluster with 3072 GPUs, implement effective GPU health monitoring using Prometheus and DCGM Exporter, and handle GPU Direct Remote Direct Memory Access (GDRDMA) monitoring. Discover practical insights into addressing failure scenarios during large language model training, and understand the engineering considerations needed when working with GPU clusters across multiple cloud providers. Gain valuable knowledge about maintaining healthy node fleets and interconnect fabric while training state-of-the-art LLMs at scale.

Syllabus

Operationalizing High-Performance GPU Clusters in Kubernetes: Lessons Learne... Will Gleich & Wai Wu

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Operationalizing High-Performance GPU Clusters in Kubernetes - Lessons Learned from Training Databricks DBRX

Taught by

Keep HPC Running - SRE's Guide to Supporting GPUs on Kubernetes

Tackling GPU Shortages and High Costs by Harnessing Hybrid Kubernetes Clusters

Mastering GPU Management in Kubernetes Using the Operator Pattern

Training Large Language Models on Kubernetes

9 Best Kubernetes Courses for 2024

Never Stop Learning.