Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Implementing Out-of-Band CPER Logs in Meta's AI ML System - A Case Study

Open Compute Project via YouTube

Overview

Learn about Meta's implementation of standardized CPER (Common Platform Error Reporting) for out-of-band error reporting in AI/ML systems through this 24-minute technical presentation from Open Compute Project. Hardware Systems Engineers Anil Agrawal and Jinghan Yang share their experience transitioning from proprietary methods to a standardized approach for accelerator-detected errors. Explore the technical challenges encountered during implementation, discover the solutions developed, and gain valuable insights for future deployments of similar error reporting systems.

Syllabus

A Case Study of implementing out of band CPER logs in a Metas AI ML system

Taught by

Open Compute Project

Reviews

Start your review of Implementing Out-of-Band CPER Logs in Meta's AI ML System - A Case Study

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.