Overview
Explore a conference talk that delves into the practical applications of the Dickerson Pyramid in Site Reliability Engineering. Learn how to implement each level of the hierarchy using real-life examples from Google, Hillary for America, and First Look Media. Discover how to define reliability for your organization and prevent future outages by focusing on monitoring, incident response, postmortems, testing and releasing, capacity planning, development, and product. Gain insights from Nat Welch's decade-long experience in software engineering and his role as Lead Site Reliability Engineer at First Look Media. Understand how SRE priorities differ from those of product engineers and how to apply these concepts to improve your organization's reliability practices.
Syllabus
Intro
Nat Welch
Quick Aside: Context
Monitoring!
Grow monitoring to match your business needs
Incident Response
The fight against noise
Postmortems!
Google Compute Engine Postmortems
Testing & Releasing
Two quick stories
Capacity Planning
Pogostick
Development
Communication!
Examples
User Experience
References. Further Reading
Taught by
Strange Loop Conference