Site Reliability Engineering at Google

Overview

Explore the principles and practices of Site Reliability Engineering (SRE) at Google in this 51-minute conference talk from GOTO Amsterdam 2018. Gain insights from Christof Leng, a Senior Site Reliability Engineer, as he delves into how Google manages its vast tech infrastructure and products. Learn about the SRE approach that treats operations as a software engineering problem, addressing challenges of scale, growth, and complexity. Discover key concepts such as error budgets, the 50% cap on operational work, and the importance of keeping developers in the rotation. Understand how SRE teams minimize damage during outages, prevent recurrences, and implement a post-mortem philosophy. Get an overview of SRE organizational structure, staffing strategies, and the balance between development and operations work. This talk provides valuable knowledge for those interested in modern approaches to maintaining reliable, scalable systems in large-scale tech environments.

Syllabus

Intro
Speaker Introduction
Why Reliability?
Reliability is easy to take for granted
SRE Organizational Structure
What do you spend your budget on?
The rule
Two nice features of Error Budgets
Staffing, Work, Ops Overload
SRE hires only coders
50% cap on Ops work
Keep DEV in the rotation
Speaking of Dev and Ops work...
SRE Portability
Limiting operational work
Death, taxes, and outages...
Minimize Damage
A word on practice...
Prevent recurrence
Post-mortem philosophy
O'Reilly Book
Questions on any of these?

Taught by

GOTO Conferences

Reviews

Start your review of Site Reliability Engineering at Google

Taught by

Site Reliability Engineering at Google

Ten Things We've Learned From Running Production Infrastructure at Google

DO, RE, Me - Measuring the Effectiveness of Site Reliability Engineering

Rolling out Error Budgets Across a 1000 Person Global Engineering Org

Never Stop Learning.