Learn essential debugging techniques and best practices for complex AI GPU systems in this technical presentation from Microsoft engineers. Explore key challenges in PCIe subsystem debugging, including PCIe path analysis, training issues, and error handling across CPU-GPU connections. Discover effective approaches for troubleshooting system hangs and crashes, while gaining insights into UBB management controller complexities and their interaction with BMC. Master practical debugging strategies through real-world examples of critical use cases, common failures, and essential hardware/software tools that streamline the debugging process in AI GPU environments.
Overview
Syllabus
Debug ability and Debug Practices of AI GPU Systems
Taught by
Open Compute Project