Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Are We Getting Better Yet? - Progress Toward Safer Operations

USENIX via YouTube

Overview

Explore strategies for improving operational safety and incident management in this SREcon20 Americas talk. Delve into the complexities of measuring progress beyond shallow metrics, understanding the subtle influences of incidents on organizations, and leveraging thorough incident analysis for deeper insights. Learn how to uncover unseen opportunities through meta-analyses across incidents, provide leaders with richer data for strategic decision-making, and foster trust between leadership and practitioners. Discover the importance of prioritizing a "learn and adapt" safety mode, conducting chaos experiments as practice, and considering performance variability. Examine the balance between control and influence, opportunity and obligation, and the significance of asking deeper questions during incident debriefings. Gain valuable insights into creating healthier, happier teams and advancing toward safer operations in complex systems.

Syllabus

Complexity
Law of Stretched Systems
Prioritize a learn and adapt safety mode over a prevent and fix safety mode
Prevent & Fix
Learn & Adapt
Measuring progress
Metrics anchor the story and the story gives meaning to the metrics
Barriers and guardrails are used to prevent people from repeating mistakes
Performance variability
Ensure positive outcomes through activities like team practice and chaos experiments
Chaos experiments as scrimmage
Incidents are a source of insights
Service Level Objectives
Control vs Influence
Watch the inputs Influence the outputs
Opportunity vs Obligation
Judging human performance with metrics applies conclusions without context
Recording performance metrics promotes one perspective over others
Interview Debriefing
Ask deeper questions
How close to the safety boundary is the pod autoscaler pushing my infrastructure?
Are my cloud provider's staff a team player in my sociotechnical system?
Recap

Taught by

USENIX

Reviews

Start your review of Are We Getting Better Yet? - Progress Toward Safer Operations

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.