How to SRE When Everything's Already on Fire

Overview

Discover how to implement Site Reliability Engineering (SRE) practices in a challenging environment through this 40-minute conference talk from SREcon19 Europe/Middle East/Africa. Follow Squarespace engineers Alex Hidalgo and Alex Lee as they share their journey of transforming a struggling centralized logging platform from 85% reliability to a documented 99.9% uptime. Learn about key SRE concepts such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. Explore the implementation of the Incident Command System (ICS) and its role in addressing operational challenges. Gain insights into data collection strategies, lessons learned, and the importance of incremental progress in improving system reliability. Understand how to prioritize user-focused alerting and apply SRE principles to resolve long-standing incidents, even when starting from a critical state.

Syllabus

Intro
A PHENOMENAL EVENING
ELK @ SQUARESPACE
SERVICE RELIABILITY PRINCIPLES
THE RELIABILITY STACK
SERVICE LEVEL INDICATORS
SERVICE LEVEL OBJECTIVES
ERROR BUDGETS ARE AWESOME
THIS RELIABILITY STUFF ISN'T NEW
THE INCIDENT COMMAND SYSTEM
PROBLEMS THE ICS ADDRESSES
OPERATIONS LEAD
INCIDENT COMMANDER 1
TIMELINE OF A 37-HOUR INCIDENT
SEE THE FOREST FOR THE TREES
THE UNSHARDENING
KEY COMPONENTS
DATA COLLECTION
LESSONS LEARNED
REPAIR ITEMS
PROGRESS IS INCREMENTAL
ALERT ON WHAT MATTERS Put your users first

Taught by

USENIX

Reviews

Start your review of How to SRE When Everything's Already on Fire

Taught by

Reducing Trauma in Production with SLOs and Chaos Engineering

My Life as a Solo SRE

Identifying Hidden Dependencies

Never Stop Learning.