Completed
Traffic spike
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Antics, Drift, and Chaos
Automatically move to the next video in the Classroom when playback concludes
- 1 Antics, drift and chaos
- 2 Add a new test
- 3 Result: execution of unit test led to an outage
- 4 Moral: use unit tests sparingly, for they are dangerous
- 5 Complex systems exhibit unexpected behavior
- 6 System failure
- 7 Generalized Uncertainty Principle
- 8 Error handling
- 9 Latency increases
- 10 More clients retry
- 11 Support systems
- 12 Non-critical service failed
- 13 Lock shared by app threads
- 14 Lock contention
- 15 Memory leak bug in agent that monitors health of EBS servers
- 16 Mitigation
- 17 Command input entered incorrectly
- 18 Lorin's conjecture
- 19 Recap: Antics
- 20 Act II: Drift
- 21 Broken parts and sloppy devs
- 22 Drift into failure
- 23 Unruly technology
- 24 Software is hard to reason about
- 25 Scarcity and competition
- 26 Efficiency vs thoroughness
- 27 Decrementalism
- 28 Sensitive dependence on initial conditions
- 29 One day...
- 30 Traffic spike
- 31 Recap: Drift
- 32 Make the wrong thing harder
- 33 Chaos engineering
- 34 Find vulnerabilities before they become outages
- 35 External validity
- 36 Risk: vulnerable to failure of non-critical services
- 37 Build a hypothesis around steady state behavior
- 38 Vary real-world events
- 39 Fail RPC calls
- 40 Add latency to RPC calls
- 41 Run experiments in production
- 42 Route prod traffic to ChAP clusters
- 43 Automate experiments to run continuously
- 44 Integrate with deployment pipelines
- 45 Minimize blast radius
- 46 Route a small fraction of traffic
- 47 Takeaways
- 48 1. Systems behave pathologically
- 49 Chaos experiments can find pathologies
- 50 2. Reasonable human decisions can lead to dangerous states
- 51 Chaos provides incentives