Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

The Day We Deleted Production - Kubernetes Infrastructure Recovery at CERN

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Explore a critical incident at CERN where a maintenance tool accidentally deleted a third of the production capacity in minutes. Learn how the Kubernetes infrastructure at CERN, which runs various workloads from scientific computing to critical services for campus and physics accelerator complex, managed to avoid downtime and recover quickly. Discover the architecture for high service availability, strategies to reduce blast radius, the concept of "clusters as cattle," and the crucial role of GitOps in saving the day. Gain insights into lessons learned, including cyclic dependencies during major outage recovery and considerations for stateful workloads and multi-cluster scheduling. Watch a live demonstration of CERN services recovering from what would have been a severe event in the past, and understand how years of effort have resulted in calm user responses during major incidents.

Syllabus

The Day We Delete(d) Production - Ricardo Rocha & Spyridon Trigazis, CERN

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of The Day We Deleted Production - Kubernetes Infrastructure Recovery at CERN

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.