Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Confessions of a Systems Engineer - Learning from My 20+ Years of Failure

USENIX via YouTube

Overview

Explore insights from over two decades of systems engineering experience in this 39-minute SREcon conversation with David Argent from Amazon. Gain valuable lessons learned from failures in designing and running large-scale online services. Discover key concepts such as minimizing change impact, implementing thorough monitoring, automating mitigations, and designing for quick incident resolution. Learn about the importance of regular process exercises, enforcing processes with technology, and understanding all supported scenarios. Benefit from Argent's diverse background spanning roles like Technical Writer, Systems Engineer, and Lead Problem Engineer across companies like Microsoft and Amazon.

Syllabus

Intro
There Are No Safe Changes
Minimize the Blast Radius on Changes
Monitor Accurately and Measure Thoroughly
Automate Mitigations
Degraded Service Modes, or An Imperfect Experience Usually Beats a Nonexistent One
Use Functional Gates Pre-, Post- and During Releases
Design to Meet SLAs and Mitigate Incidents Quickly
Regularly Exercise All Processes and Tools
Enforce Processes with Technology
Redirect or Drop Traffic Aggressively During Incidents
Production Quality Tools
Sanitize and verify Inputs
Understand All of the Scenarios You Support
Transition Service Responsibilities Carefully

Taught by

USENIX

Reviews

Start your review of Confessions of a Systems Engineer - Learning from My 20+ Years of Failure

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.