Overview
Explore a 26-minute conference talk from SREcon19 Europe/Middle East/Africa that chronicles one professional's journey as a solo Site Reliability Engineer (SRE). Discover how Brian Murphy transformed his organization's engineering culture after a challenging year in 2015. Learn about the implementation of SRE practices, including the introduction of Service Level Indicators (SLIs), reduction of Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR), and improvement of release cadence. Gain valuable insights and practical advice on enhancing both personal and organizational performance in the field of SRE. The talk covers various aspects such as dealing with alerts, creating dashboards, focusing on reliability and availability, fostering value-based conversations, promoting accountability, and sharing knowledge. Understand the challenges faced, strategies employed, and lessons learned in turning around a struggling engineering organization through the adoption of SRE principles.
Syllabus
Introduction
Loud Clock
Problems
Alerts
Is everything wrong
I got paged
Im not shy
The madness phase
SLeyes
SLos
Dashboards
Reliability
Availability
Valuebased conversation
Accountability
Sharing knowledge
Getting developers on board
Business side of the house
Turning it around
What didnt go so well
Conclusion
Questions
Taught by
USENIX