Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the intricacies of incident response and coordination in remote SRE teams through this 48-minute conference talk from SREcon20 Americas. Delve into Dr. Laura Maguire's three-year research on engineering teams handling service outages, examining 62 cases across four organizations. Discover surprising findings that challenge existing domain models, including how incident management differs from GoogleSRE suggestions and how incident command can hinder fast resolution. Learn about the subtle choreography of cognitive work in fault management, the potential drawbacks of coordination tools, and strategies for adaptive choreography. Gain insights into how tooling and intra-organizational dependencies affect coordination costs across time and organizational boundaries, increasing complexity for SREs. Understand the challenges of coordinating multiple perspectives, dealing with backup issues, and managing hidden complexities in distributed computing environments.
Syllabus
Introduction
The Secret Lives of SREs
Coordinate Multiple Diverse Perspectives
Backup Issues
Hidden Complexity
Outlier Event
Sarah
Sarahs Knowledge
Incident Response
Incident Command
Speed Bumps
Distributed Computing
Conclusion
Taught by
USENIX