Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

One on One SRE

USENIX via YouTube

Overview

Explore a conference talk from SREcon19 Americas that delves into the human side of Site Reliability Engineering (SRE) at GitHub. Learn how Amy Tobey implemented 1:1 SRE outreach and incident debrief programs to foster a culture of resilience, emphasizing empathy and psychological safety. Discover the program's structure, launch process, and real-world outcomes. Gain insights into handling trauma, insufficient guard rails, and unknown unknowns in SRE. Understand the importance of informed consent and effective incident debriefing techniques. Examine key questions for post-incident analysis, including role assessment, support evaluation, preventability, and lessons learned. Consider how individual contributors can impact reliability at an organizational scale through deviant behavior and spanning tree concepts.

Syllabus

Intro
trauma: extreme stress that overwhelms a person's ability to cope
insufficient guard rails
unknown unknowns
The 1:1 Incident Debrief
informed consent
what was your role in the incident?
how long did you work on the incident?
were you able to get the support you needed?
do you feel that the incident was preventable?
what actions do you feel good about?
what do you think could have been better?
what did you learn from this incident?
what do you think we can do to prevent reoccurrence?
did our tools and documentation serve you well?
did you practice self-care during this process?
can you think of anyone else we should talk to?
spanning tree
deviant behavior
How can I, an individual contributor, impact reliability at organizational scale?

Taught by

USENIX

Reviews

Start your review of One on One SRE

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.