Overview
Explore the inner workings of Datadog's metrics backend in this SREcon22 Americas conference talk. Delve into the evolution of Datadog's distributed system, from its small beginnings to its current large-scale operation across major cloud providers. Learn about the scaling and reliability challenges faced by the team, their solutions, and the key lessons and strategies that emerged. Gain insights into Kafka's role at Datadog, partitioning techniques, and handling various failure scenarios. Discover how the system manages node, availability zone, Kubernetes cluster, and cloud vendor failures. Understand the importance of balanced topics, consumer shards, and addressing partition imbalance. Get a glimpse of unsolved problems and future plans for Datadog's metrics backend. Presented by Adam Mckaig, Staff Engineer, and Tahia Khan, SRE at Datadog, this talk offers valuable knowledge for those interested in large-scale distributed systems and cloud monitoring.
Syllabus
Intro
What is a "Data Dog" ?
Overview
Kafka 101
Kafka at Datadog
Baby's First Keyfunc
Node Failure
AZ Failure
Kube Cluster Failure
Cloud Vendor Failure
Remember This
Partitioning (not the Kafka kind)
Partitioning: Before
Partitioning: After
Balanced Topic
Consumer Shards
Big Customers
Partition Imbalance
Slicer
Rebalancing
Taught by
USENIX