Overview
Syllabus
Complexity
Law of Stretched Systems
Prioritize a learn and adapt safety mode over a prevent and fix safety mode
Prevent & Fix
Learn & Adapt
Measuring progress
Metrics anchor the story and the story gives meaning to the metrics
Barriers and guardrails are used to prevent people from repeating mistakes
Performance variability
Ensure positive outcomes through activities like team practice and chaos experiments
Chaos experiments as scrimmage
Incidents are a source of insights
Service Level Objectives
Control vs Influence
Watch the inputs Influence the outputs
Opportunity vs Obligation
Judging human performance with metrics applies conclusions without context
Recording performance metrics promotes one perspective over others
Interview Debriefing
Ask deeper questions
How close to the safety boundary is the pod autoscaler pushing my infrastructure?
Are my cloud provider's staff a team player in my sociotechnical system?
Recap
Taught by
USENIX