Overview
Explore a groundbreaking approach to handling server hardware failures in cloud platforms through this 15-minute conference talk from OSDI '23. Delve into Hyrax, an innovative datacenter stack that allows compute servers with failed components to continue hosting VMs while masking degraded capacity and performance. Discover how this fail-in-place operation model addresses the unsustainability of traditional all-or-nothing approaches and aligns with emerging technology trends. Learn about the novel model for changes in memory interleaving when deactivating faulty memory modules, a key enabler of Hyrax. Examine experimental results from cloud production servers demonstrating Hyrax's ability to overcome common hardware failures without impacting peak VM performance. Analyze large-scale simulation findings using production traces, revealing how Hyrax reduces server repair requirements by 50-60% while maintaining VM scheduling efficiency.
Syllabus
OSDI '23 - Hyrax: Fail-in-Place Server Operation in Cloud Platforms
Taught by
USENIX