Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Monitoring GPUs at Scale for AI - ML and HPC Clusters

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Explore a comprehensive conference talk on monitoring GPU clusters for AI/ML and HPC workloads at scale. Learn how NVIDIA addresses the monitoring needs of various user personas, including AI/ML researchers, operations teams, and stakeholders. Discover the combination of open-source tools used to meet diverse requirements and gain insights into deployment, maintenance, security, and scalability challenges encountered when monitoring GPU data. Understand how NVIDIA overcame these obstacles to create an effective monitoring solution for large GPU Kubernetes clusters running deep learning training workloads.

Syllabus

Monitoring GPUs at Scale for AI/ML and HPC Clusters - Bharti L Agrawal, NVIDIA

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Monitoring GPUs at Scale for AI - ML and HPC Clusters

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.