Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a complex incident in a Kubernetes environment through this SREcon23 Americas conference talk. Follow the journey of Hemanth Malla and Elijah Andrews from Datadog as they unravel a mysterious issue initially thought to be DNS-related. Dive deep into the debugging process, from examining logs to reading kernel code, and discover how the team uncovered connections between gRPC client reconnect algorithms and an overflowing conntrack table. Learn about the unexpected twists and turns in their investigation, including the analysis of Martian packet dropping and other corner cases. Gain insights into troubleshooting complex systems and understand how the resolution ultimately came down to removing just three lines of code. This 38-minute presentation offers valuable lessons for SREs and system administrators dealing with intricate infrastructure problems.