Hyrax - Fail-in-Place Server Operation in Cloud Platforms

Overview

Explore a groundbreaking approach to handling server hardware failures in cloud platforms through this 15-minute conference talk from OSDI '23. Delve into Hyrax, an innovative datacenter stack that allows compute servers with failed components to continue hosting VMs while masking degraded capacity and performance. Discover how this fail-in-place operation model addresses the unsustainability of traditional all-or-nothing approaches and aligns with emerging technology trends. Learn about the novel model for changes in memory interleaving when deactivating faulty memory modules, a key enabler of Hyrax. Examine experimental results from cloud production servers demonstrating Hyrax's ability to overcome common hardware failures without impacting peak VM performance. Analyze large-scale simulation findings using production traces, revealing how Hyrax reduces server repair requirements by 50-60% while maintaining VM scheduling efficiency.

Syllabus

OSDI '23 - Hyrax: Fail-in-Place Server Operation in Cloud Platforms

Taught by

USENIX

Reviews

Start your review of Hyrax - Fail-in-Place Server Operation in Cloud Platforms

Taught by

Core Slicing - Closing the Gap Between Leaky Confidential VMs and Bare-Metal Cloud

Flor - An Open High Performance RDMA Framework Over Heterogeneous RNICs

ORC - Increasing Cloud Memory Density via Object Reuse with Capabilities

ServiceRouter - Hyperscale and Minimal Cost Service Mesh at Meta

Kerveros - Efficient and Scalable Cloud Admission Control

Hydro - Surrogate-Based Hyperparameter Tuning Service in Datacenters

Never Stop Learning.