Learn about Meta's implementation of standardized CPER (Common Platform Error Reporting) for out-of-band error reporting in AI/ML systems through this 24-minute technical presentation from Open Compute Project. Hardware Systems Engineers Anil Agrawal and Jinghan Yang share their experience transitioning from proprietary methods to a standardized approach for accelerator-detected errors. Explore the technical challenges encountered during implementation, discover the solutions developed, and gain valuable insights for future deployments of similar error reporting systems.
Implementing Out-of-Band CPER Logs in Meta's AI ML System - A Case Study
Open Compute Project via YouTube
Overview
Syllabus
A Case Study of implementing out of band CPER logs in a Metas AI ML system
Taught by
Open Compute Project