Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a conference talk on Thunderbolt, a hardware-agnostic power capping system designed for hyperscale data centers. Learn about the challenges of power oversubscription and the need for task-level quality-of-service differentiation in modern compute clusters. Discover how Thunderbolt ensures safe power oversubscription while minimizing impact on both throughput-oriented and latency-sensitive tasks. Examine the system's architecture, mechanisms, and policies, including its two-threshold control policy and use of CPU bandwidth control. Understand the benefits of Thunderbolt's reactive and proactive capping approaches, and see real-world deployment results in production clusters. Gain insights into power efficiency improvements and the potential for significant power oversubscription gains in data center environments.
Syllabus
Intro
Motivation: power oversubscription and capping
Motivation: task QoS differentiation
Prior industry solutions did not meet our needs
Architecture
Mechanism and policy details
Why not RAPL or DVFS?
CPU bandwidth control, DVFS, RAPL on Intel Skylake CPU
Reactive capping policy: load shaping
Load shaping on a production cluster
Proactive capping mechanism: CPU jailing Deterministic machine CPU cap
20% CPU jailing on a production cluster
Proactive capping policy: risk assessment
Deployed in logs processing clusters
Summary
Taught by
USENIX