Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Evolution of Incident Management at Slack

USENIX via YouTube

Overview

Explore the evolution of incident management at Slack in this 28-minute conference talk from SREcon21. Discover how the company handles dozens of incidents weekly while delivering over 150 million messages per minute at peak. Learn about Slack's journey to make incident management a core capability for their entire engineering team, including their history, reliability crisis, and vision for incident management. Gain insights into their incident management plan, training, severity levels, and the roles of Major Incident Commanders. Understand how Slack manages simultaneous incidents, implements Area Command, and handles long-duration and pillar incidents. Examine ongoing challenges, recruitment and training strategies, and the impact of success on incident management practices.

Syllabus

Intro
History of Slack
Reliability Crisis
Incident Management Vision
Incident Management Plan
Incident Management Training
Severity Levels
Major IC
Major IC oncall
Major IC responsibility
Simultaneous incidents
Area Command
Long Duration Incidents
Pillar Incidents
Whats Next
Ongoing Challenges
Recruitment and Training
Challenge of Success

Taught by

USENIX

Reviews

Start your review of Evolution of Incident Management at Slack

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.