Overview
Explore a 31-minute conference talk from SREcon24 Europe/Middle East/Africa that delves into applying Site Reliability Engineering principles to High Performance Computing (HPC) systems. Learn how LANL addresses the challenges of managing purpose-built HPC machines traditionally operated through human-facing workflows. Discover how the adoption of SRE methodologies in the new administrative stack OpenCHAMI helps maintain critical performance metrics while combating generational churn in HPC systems. Understand how this approach ensures exact reproducibility, parallel bandwidth, and optimal compute time to solution while better serving the specific needs of specialized code bases and customer requirements.
Syllabus
SREcon24 Europe/Middle East/Africa - Science Reliability Engineering for High Performance Computing
Taught by
USENIX