A Story of Managing Kubernetes Cluster with 15k Nodes and Various Workloads

Overview

Explore managing a massive Kubernetes cluster with 15,000 nodes and diverse workloads in this conference talk from KubeCon + CloudNativeCon Europe 2022. Discover how Ant Group evolved their cluster to accommodate large-scale mixed long-running workloads and offline big data/ML training jobs. Learn about performance optimization techniques, including apiserver improvements, client-side enhancements, and etcd performance considerations. Gain insights into identifying bottlenecks through traffic analysis, simulation, and testing. Understand methods for reducing response times for list, create, update, and delete operations to meet SLO requirements. Explore topics such as load simulation, SLO-based performance measurement, live upgrading, webhook performance, and controller performance. Benefit from real-world experiences in managing one of the largest Kubernetes clusters in the community.

Syllabus

Intro
Large scale K8s clusters
How do we get there?
Load Simulation
SLO-based performance measuremen
Performance tuning
Etcd Performance
Live upgrading
Apiserver Intro
Apiserver Performance
Webhook Performance
Controller Performance
Apiserver related results
Client RT Results

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of A Story of Managing Kubernetes Cluster with 15k Nodes and Various Workloads

Taught by

Managing Kubernetes Watch Events End-to-End Flow in Extremely Large Clusters

Never Stop Learning.