Operationalizing High-Performance GPU Clusters in Kubernetes - Lessons Learned from Training Databricks DBRX
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Explore a technical conference talk that delves into the challenges and solutions of operating high-performance GPU clusters within Kubernetes environments, specifically focusing on the training of Databricks DBRX. Learn how to manage a 400-node cluster with 3072 GPUs, implement effective GPU health monitoring using Prometheus and DCGM Exporter, and handle GPU Direct Remote Direct Memory Access (GDRDMA) monitoring. Discover practical insights into addressing failure scenarios during large language model training, and understand the engineering considerations needed when working with GPU clusters across multiple cloud providers. Gain valuable knowledge about maintaining healthy node fleets and interconnect fabric while training state-of-the-art LLMs at scale.
Syllabus
Operationalizing High-Performance GPU Clusters in Kubernetes: Lessons Learne... Will Gleich & Wai Wu
Taught by
CNCF [Cloud Native Computing Foundation]