Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Operationalizing High-Performance GPU Clusters in Kubernetes - Lessons Learned from Training Databricks DBRX

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Explore a technical conference talk that delves into the challenges and solutions of operating high-performance GPU clusters within Kubernetes environments, specifically focusing on the training of Databricks DBRX. Learn how to manage a 400-node cluster with 3072 GPUs, implement effective GPU health monitoring using Prometheus and DCGM Exporter, and handle GPU Direct Remote Direct Memory Access (GDRDMA) monitoring. Discover practical insights into addressing failure scenarios during large language model training, and understand the engineering considerations needed when working with GPU clusters across multiple cloud providers. Gain valuable knowledge about maintaining healthy node fleets and interconnect fabric while training state-of-the-art LLMs at scale.

Syllabus

Operationalizing High-Performance GPU Clusters in Kubernetes: Lessons Learne... Will Gleich & Wai Wu

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Operationalizing High-Performance GPU Clusters in Kubernetes - Lessons Learned from Training Databricks DBRX

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.