Explore a complex incident in a Kubernetes environment through this SREcon23 Americas conference talk. Follow the journey of Hemanth Malla and Elijah Andrews from Datadog as they unravel a mysterious issue initially thought to be DNS-related. Dive deep into the debugging process, from examining logs to reading kernel code, and discover how the team uncovered connections between gRPC client reconnect algorithms and an overflowing conntrack table. Learn about the unexpected twists and turns in their investigation, including the analysis of Martian packet dropping and other corner cases. Gain insights into troubleshooting complex systems and understand how the resolution ultimately came down to removing just three lines of code. This 38-minute presentation offers valuable lessons for SREs and system administrators dealing with intricate infrastructure problems.
Overview
Syllabus
SREcon23 Americas - Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn't DNS
Taught by
USENIX