How the OOM-Killer Deleted My Namespace, and Other Kubernetes Tales
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Syllabus
Intro
Datadog
Symptoms
Investigation
Deletion call, 4d before Audit logs for the namespace
Spinnaker deploys (v1)
Helm 3 deploys (v2)
Big difference
What happened?
Namespace Controller logs Virtual
Events so far
Metrics-server setup
Metrics-server deployment
Full chain of events
Key take-away Apiservice extensions are great but can impact your cluster
Context
Runtime is down?
CNI status
Containerd goroutine dump Blocked goroutines?
Seems CNI related
What about Delete?
CNI plugin
The root cause
What we know
Apiserver requests
Illustration
What about label filters?
Informers instead of List How do informers work?
Back to the incident
Nodegroup controller?
How did it work?
What we learned
Conclusion
Taught by
CNCF [Cloud Native Computing Foundation]