Overview
Embark on a thrilling debugging journey in this 47-minute conference talk from Code BEAM Europe 2022. Explore the unexpected challenges faced by Klarna's fault-tolerant Erlang system, Kred, when a seemingly minor Kafka outage led to a catastrophic cluster failure. Delve into the intricate process of identifying, fixing, and preventing the elusive "cluster-killer bug" through a series of unexpected twists and deep dives into the Erlang technology stack. Gain valuable insights into Erlang's memory model and acquire new tools for debugging low-level issues in Erlang applications. Follow along as the speaker navigates through system architecture, troubleshooting techniques, metric analysis, and lock-up testing, ultimately unraveling the mystery behind the system's vulnerability. Perfect for developers looking to enhance their debugging skills and gain a deeper understanding of fault tolerance in complex Erlang systems.
Syllabus
00:00 - - Intro and Fault Tolerance
04:40 - - System Architecture
08:28 - - Troubleshooting
09:16 - - Identify
13:30 - - Fix
15:27 - - Alert + Identify + Fix
20:37 - - The incident
21:33 - - Symptoms
27:28 - - Validate
29:46 - - The Path of Metrics
34:11 - - Testing lock-ups
40:20 - - The Mystery Term
Taught by
Code Sync