The Day We Deleted Production - Kubernetes Infrastructure Recovery at CERN
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a critical incident at CERN where a maintenance tool accidentally deleted a third of the production capacity in minutes. Learn how the Kubernetes infrastructure at CERN, which runs various workloads from scientific computing to critical services for campus and physics accelerator complex, managed to avoid downtime and recover quickly. Discover the architecture for high service availability, strategies to reduce blast radius, the concept of "clusters as cattle," and the crucial role of GitOps in saving the day. Gain insights into lessons learned, including cyclic dependencies during major outage recovery and considerations for stateful workloads and multi-cluster scheduling. Watch a live demonstration of CERN services recovering from what would have been a severe event in the past, and understand how years of effort have resulted in calm user responses during major incidents.
Syllabus
The Day We Delete(d) Production - Ricardo Rocha & Spyridon Trigazis, CERN
Taught by
CNCF [Cloud Native Computing Foundation]