Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

PCIe Express Error Handling and RAS Solutions for AI/ML Training Clusters

Open Compute Project via YouTube

Overview

Learn about innovative PCIe fault handling solutions in this 16-minute technical presentation from Meta engineers Anil Agrawal and Carlos Fernandez at Open Compute Project. Discover how Meta manages PCIe correctable errors at scale within their AI/ML training clusters, which utilize numerous PCIe devices including GPUs, NICs, NVMe drives, and PCIe switches. Gain valuable insights into implementing robust Reliability, Availability, and Serviceability (RAS) solutions for PCIe device hierarchies, particularly in environments where high rates of correctable errors are expected. Explore practical considerations and lessons learned from Meta's experience in maintaining target uptime and serviceability objectives in large-scale AI/ML infrastructure.

Syllabus

PCIe Express corrected errors handling (RAS) solution implementation considerations

Taught by

Open Compute Project

Reviews

Start your review of PCIe Express Error Handling and RAS Solutions for AI/ML Training Clusters

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.