Science Reliability Engineering for High Performance Computing

Overview

Explore a 31-minute conference talk from SREcon24 Europe/Middle East/Africa that delves into applying Site Reliability Engineering principles to High Performance Computing (HPC) systems. Learn how LANL addresses the challenges of managing purpose-built HPC machines traditionally operated through human-facing workflows. Discover how the adoption of SRE methodologies in the new administrative stack OpenCHAMI helps maintain critical performance metrics while combating generational churn in HPC systems. Understand how this approach ensures exact reproducibility, parallel bandwidth, and optimal compute time to solution while better serving the specific needs of specialized code bases and customer requirements.

Syllabus

SREcon24 Europe/Middle East/Africa - Science Reliability Engineering for High Performance Computing

Taught by

USENIX

Reviews

Start your review of Science Reliability Engineering for High Performance Computing

Taught by

Selective Reliability Engineering: There Is No Single Source of Truth

Improving Performance with High Precision Clock Synchronization in Distributed Systems

The Silent Performance Killers: BIOS and Firmware Updates

Scheduling at Scale: eBPF Schedulers with Sched_ext

Enhancing Elasticsearch Performance: Innovative Reindexing Strategies Using Dedicated Nodes and KEDA Autoscalers

Energy Consumption of Datacenters: Impact of AI and GPU Computing

Never Stop Learning.