Explore an innovative end-to-end analytics service designed for safe deployment in large-scale cloud infrastructure. Learn about Gandalf, a system developed by Microsoft Azure to enable rapid and robust impact assessment of software rollouts, preventing widespread outages caused by bad updates. Discover how Gandalf monitors and analyzes various fault signals, correlating them against ongoing rollouts using spatial and temporal algorithms. Understand the core decision logic, including an ensemble ranking algorithm and binary classifier, which determine the safety of rollouts. Gain insights into Gandalf's lambda architecture, providing both real-time and long-term deployment monitoring with automated decisions and notifications. Examine the impressive results achieved in Microsoft Azure's production environment, with high precision and recall rates for both data-plane and control-plane rollouts. This conference talk from NSDI '20 offers valuable knowledge for professionals working on large-scale cloud systems and deployment safety.
Overview
Syllabus
NSDI '20 - Gandalf: An Intelligent, End To End Analytics Service for Safe Deployment in Large Scale
Taught by
USENIX