Overview
Syllabus
Intro
Big data is operationally complex.
Observability is evolving quickly.
Two dozen engineers build Honeycomb.
We make systems humane to run
by ingesting telemetry
enabling data exploration
and empowering engineers.
We deploy with confidence.
Continuous delivery is an investment.
Continuity of operations even more so.
Stable platforms empower innovation.
but stateful services can be scary.
We need velocity and reliability.
Quantify reliability.
Identify potential areas of risk.
Design experiments to probe risk.
Prioritize addressing risks.
How broken is "too broken"?
Service Level Objectives define success.
SLOs are common language.
Think in terms of events in context.
HTTP Code 200? Latency 100ms?
Set a target Service Level Objective.
Use a window and target percentage.
We keep SLOs at Honeycomb.
We store incoming telemetry.
Alerts usually evaluate every minute.
Often, queries come back under 10s.
Error budget: allowed unavailability
Is it safe to do this risky experiment?
Data persistence is tricky.
Experiment using error budgets.
Infrequent changes.
Long-running processes.
Data integrity and consistency.
Delicate failover dances
Restart one server & service at a time.
Bugs are shallow with more eyes.
Monitor for changes using SLIs.
Debug with observability.
Test the telemetry too!
Verify fixes by repeating.
Continuously verify to stop regression.
Save money with flexibility.
Hypothesize, test, and learn.
Celebrate successes and failures.
Be more reliable & scalable.
Sleep easily at night.
You can do this too, step by step.
Read our blog! hny.co/blog
Taught by
USENIX