Antics, Drift, and Chaos

Overview

Explore the concept of Chaos Engineering in large-scale cloud systems through this 30-minute conference talk from Strange Loop. Discover how successful systems grow organically and fail in complex ways, and learn to apply Chaos Engineering techniques to detect weaknesses before they cause real harm. Delve into systems theory and its predictions about system behavior and failure, drawing insights from safety researcher Sidney Dekker's "drift into failure" theory and pediatrician John Gall's "Generalized Uncertainty Principle." Gain practical knowledge on designing and running Chaos Engineering experiments, with examples from Netflix's experiences. Understand the inevitability of system failures, the importance of identifying vulnerabilities proactively, and the benefits of integrating Chaos Engineering into deployment pipelines to minimize risks and improve system resilience.

Syllabus

Antics, drift and chaos
Add a new test
Result: execution of unit test led to an outage
Moral: use unit tests sparingly, for they are dangerous
Complex systems exhibit unexpected behavior
System failure
Generalized Uncertainty Principle
Error handling
Latency increases
More clients retry
Support systems
Non-critical service failed
Lock shared by app threads
Lock contention
Memory leak bug in agent that monitors health of EBS servers
Mitigation
Command input entered incorrectly
Lorin's conjecture
Recap: Antics
Act II: Drift
Broken parts and sloppy devs
Drift into failure
Unruly technology
Software is hard to reason about
Scarcity and competition
Efficiency vs thoroughness
Decrementalism
Sensitive dependence on initial conditions
One day...
Traffic spike
Recap: Drift
Make the wrong thing harder
Chaos engineering
Find vulnerabilities before they become outages
External validity
Risk: vulnerable to failure of non-critical services
Build a hypothesis around steady state behavior
Vary real-world events
Fail RPC calls
Add latency to RPC calls
Run experiments in production
Route prod traffic to ChAP clusters
Automate experiments to run continuously
Integrate with deployment pipelines
Minimize blast radius
Route a small fraction of traffic
Takeaways
1. Systems behave pathologically
Chaos experiments can find pathologies
2. Reasonable human decisions can lead to dangerous states
Chaos provides incentives

Taught by

Strange Loop Conference

Reviews

Start your review of Antics, Drift, and Chaos

Taught by

Tags

Never Stop Learning.