Identifying Hidden Dependencies

Overview

Explore how Honeycomb improved the reliability of their Zookeeper, Kafka, and stateful storage systems through intentional node termination in this SREcon20 Americas talk. Discover the journey from manual experiments to automated node recycling, uncovering bugs in replacement tools along the way. Learn about the importance of resilience engineering, continuous delivery, and maintaining operational continuity. Understand how to quantify reliability, identify potential risks, and design experiments to probe those risks. Delve into the concept of Service Level Objectives (SLOs) as a common language for defining success and managing error budgets. Gain insights on handling data persistence, monitoring changes using Service Level Indicators (SLIs), and leveraging observability for debugging. Follow Honeycomb's progression towards continuously running experiments, resulting in no node living longer than 12 months and weekly automated node recycling. Acquire practical knowledge on improving system reliability and scalability, applicable even without advanced automation or Kubernetes deployment.

Syllabus

Intro
Big data is operationally complex.
Observability is evolving quickly.
Two dozen engineers build Honeycomb.
We make systems humane to run
by ingesting telemetry
enabling data exploration
and empowering engineers.
We deploy with confidence.
Continuous delivery is an investment.
Continuity of operations even more so.
Stable platforms empower innovation.
but stateful services can be scary.
We need velocity and reliability.
Quantify reliability.
Identify potential areas of risk.
Design experiments to probe risk.
Prioritize addressing risks.
How broken is "too broken"?
Service Level Objectives define success.
SLOs are common language.
Think in terms of events in context.
HTTP Code 200? Latency 100ms?
Set a target Service Level Objective.
Use a window and target percentage.
We keep SLOs at Honeycomb.
We store incoming telemetry.
Alerts usually evaluate every minute.
Often, queries come back under 10s.
Error budget: allowed unavailability
Is it safe to do this risky experiment?
Data persistence is tricky.
Experiment using error budgets.
Infrequent changes.
Long-running processes.
Data integrity and consistency.
Delicate failover dances
Restart one server & service at a time.
Bugs are shallow with more eyes.
Monitor for changes using SLIs.
Debug with observability.
Test the telemetry too!
Verify fixes by repeating.
Continuously verify to stop regression.
Save money with flexibility.
Hypothesize, test, and learn.
Celebrate successes and failures.
Be more reliable & scalable.
Sleep easily at night.
You can do this too, step by step.
Read our blog! hny.co/blog

Taught by

USENIX

Reviews

Start your review of Identifying Hidden Dependencies

Taught by

Site Reliability Engineering: Measuring and Managing Reliability

Site Reliability Engineering: Measuring and Managing Reliability

Never Stop Learning.