Overview
Explore a comprehensive analysis of metastable failures in distributed systems through this 16-minute conference talk from OSDI '22. Delve into the prevalence and impact of these failures across various organizations, from small companies to hyperscalers. Discover the extended model of metastable failures, including two types of triggers and amplification mechanisms. Learn about real-world examples and their implications for system design and reliability. Gain insights into the recurring patterns of metastable failures in major outages and understand their significance in the field of distributed systems. Examine the researchers' findings from studying 22 metastable failures across 11 different organizations, and explore their reproduced examples in controlled environments. Enhance your understanding of this critical issue in distributed systems and its potential solutions.
Syllabus
Intro
What are Metastable Failures?
Metastable Failures are Prevalent
Metastability in the Wild - Survey
Defining Metastability - System States
Survey Summary
Metastability Taxonomy - Trigger
Metastability Taxonomy - Sustaining ef
Four Metastability Scenarios Load-spike trigger
Degrees of Vulnerabilities
Lessons
Conclusion
Taught by
USENIX