A Story of Managing Kubernetes Cluster with 15k Nodes and Various Workloads
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Explore managing a massive Kubernetes cluster with 15,000 nodes and diverse workloads in this conference talk from KubeCon + CloudNativeCon Europe 2022. Discover how Ant Group evolved their cluster to accommodate large-scale mixed long-running workloads and offline big data/ML training jobs. Learn about performance optimization techniques, including apiserver improvements, client-side enhancements, and etcd performance considerations. Gain insights into identifying bottlenecks through traffic analysis, simulation, and testing. Understand methods for reducing response times for list, create, update, and delete operations to meet SLO requirements. Explore topics such as load simulation, SLO-based performance measurement, live upgrading, webhook performance, and controller performance. Benefit from real-world experiences in managing one of the largest Kubernetes clusters in the community.
Syllabus
Intro
Large scale K8s clusters
How do we get there?
Load Simulation
SLO-based performance measuremen
Performance tuning
Etcd Performance
Live upgrading
Apiserver Intro
Apiserver Performance
Webhook Performance
Controller Performance
Apiserver related results
Client RT Results
Taught by
CNCF [Cloud Native Computing Foundation]