Overview
Syllabus
Antics, drift and chaos
Add a new test
Result: execution of unit test led to an outage
Moral: use unit tests sparingly, for they are dangerous
Complex systems exhibit unexpected behavior
System failure
Generalized Uncertainty Principle
Error handling
Latency increases
More clients retry
Support systems
Non-critical service failed
Lock shared by app threads
Lock contention
Memory leak bug in agent that monitors health of EBS servers
Mitigation
Command input entered incorrectly
Lorin's conjecture
Recap: Antics
Act II: Drift
Broken parts and sloppy devs
Drift into failure
Unruly technology
Software is hard to reason about
Scarcity and competition
Efficiency vs thoroughness
Decrementalism
Sensitive dependence on initial conditions
One day...
Traffic spike
Recap: Drift
Make the wrong thing harder
Chaos engineering
Find vulnerabilities before they become outages
External validity
Risk: vulnerable to failure of non-critical services
Build a hypothesis around steady state behavior
Vary real-world events
Fail RPC calls
Add latency to RPC calls
Run experiments in production
Route prod traffic to ChAP clusters
Automate experiments to run continuously
Integrate with deployment pipelines
Minimize blast radius
Route a small fraction of traffic
Takeaways
1. Systems behave pathologically
Chaos experiments can find pathologies
2. Reasonable human decisions can lead to dangerous states
Chaos provides incentives
Taught by
Strange Loop Conference