Overview
Explore the evolution of incident management at Slack in this 28-minute conference talk from SREcon21. Discover how the company handles dozens of incidents weekly while delivering over 150 million messages per minute at peak. Learn about Slack's journey to make incident management a core capability for their entire engineering team, including their history, reliability crisis, and vision for incident management. Gain insights into their incident management plan, training, severity levels, and the roles of Major Incident Commanders. Understand how Slack manages simultaneous incidents, implements Area Command, and handles long-duration and pillar incidents. Examine ongoing challenges, recruitment and training strategies, and the impact of success on incident management practices.
Syllabus
Intro
History of Slack
Reliability Crisis
Incident Management Vision
Incident Management Plan
Incident Management Training
Severity Levels
Major IC
Major IC oncall
Major IC responsibility
Simultaneous incidents
Area Command
Long Duration Incidents
Pillar Incidents
Whats Next
Ongoing Challenges
Recruitment and Training
Challenge of Success
Taught by
USENIX