Overview
Explore the concept of reliability in complex technology ecosystems through this NDC Porto 2022 conference talk by Ricardo Castro. Delve into the importance of understanding system changes and their impact on service provision. Learn about user expectations for system performance, including uptime, responsiveness, speed, consistency, and reliability. Discover how system reliability directly correlates to user satisfaction and business success. Examine the definition of reliability from a user-centric perspective and understand why perfection isn't always necessary. Gain insights into practical approaches for addressing reliability challenges with limited resources. Explore key concepts such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs). Learn how to create effective SLOs, document them, and set acceptable targets. Understand the importance of visualization, alerts, and error budget policies in maintaining system reliability. Discover the extended reliability stack and why these concepts are crucial for modern technology ecosystems.
Syllabus
Intro
An example from the real world
What is Reliability?
Service Level Indicator (SLI)
Service Level Objective (SLO)
How to create a good SLOS
SLO Document
What is an acceptable target?
Back-of-the-envelope costs calculations
Service Level Agreement (SLA)
Visualization
Alerts
Error Budget Policy
Reliability Stack Extended
Why is this important?
Shameless plug
Taught by
NDC Conferences