Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Science Reliability Engineering for High Performance Computing

USENIX via YouTube

Overview

Explore a 31-minute conference talk from SREcon24 Europe/Middle East/Africa that delves into applying Site Reliability Engineering principles to High Performance Computing (HPC) systems. Learn how LANL addresses the challenges of managing purpose-built HPC machines traditionally operated through human-facing workflows. Discover how the adoption of SRE methodologies in the new administrative stack OpenCHAMI helps maintain critical performance metrics while combating generational churn in HPC systems. Understand how this approach ensures exact reproducibility, parallel bandwidth, and optimal compute time to solution while better serving the specific needs of specialized code bases and customer requirements.

Syllabus

SREcon24 Europe/Middle East/Africa - Science Reliability Engineering for High Performance Computing

Taught by

USENIX

Reviews

Start your review of Science Reliability Engineering for High Performance Computing

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.