Completed
Takeaways
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Antics, Drift, and Chaos
Automatically move to the next video in the Classroom when playback concludes
- 1 Antics, drift and chaos
- 2 Add a new test
- 3 Result: execution of unit test led to an outage
- 4 Moral: use unit tests sparingly, for they are dangerous
- 5 Complex systems exhibit unexpected behavior
- 6 System failure
- 7 Generalized Uncertainty Principle
- 8 Error handling
- 9 Latency increases
- 10 More clients retry
- 11 Support systems
- 12 Non-critical service failed
- 13 Lock shared by app threads
- 14 Lock contention
- 15 Memory leak bug in agent that monitors health of EBS servers
- 16 Mitigation
- 17 Command input entered incorrectly
- 18 Lorin's conjecture
- 19 Recap: Antics
- 20 Act II: Drift
- 21 Broken parts and sloppy devs
- 22 Drift into failure
- 23 Unruly technology
- 24 Software is hard to reason about
- 25 Scarcity and competition
- 26 Efficiency vs thoroughness
- 27 Decrementalism
- 28 Sensitive dependence on initial conditions
- 29 One day...
- 30 Traffic spike
- 31 Recap: Drift
- 32 Make the wrong thing harder
- 33 Chaos engineering
- 34 Find vulnerabilities before they become outages
- 35 External validity
- 36 Risk: vulnerable to failure of non-critical services
- 37 Build a hypothesis around steady state behavior
- 38 Vary real-world events
- 39 Fail RPC calls
- 40 Add latency to RPC calls
- 41 Run experiments in production
- 42 Route prod traffic to ChAP clusters
- 43 Automate experiments to run continuously
- 44 Integrate with deployment pipelines
- 45 Minimize blast radius
- 46 Route a small fraction of traffic
- 47 Takeaways
- 48 1. Systems behave pathologically
- 49 Chaos experiments can find pathologies
- 50 2. Reasonable human decisions can lead to dangerous states
- 51 Chaos provides incentives