Overview
Explore a comprehensive presentation on NetBouncer, an active failure localization system for data center networks. Learn how this innovative solution leverages IP-in-IP techniques to detect both device and link failures, ensuring high availability of data center services. Discover the challenges of accurately localizing failures among millions of servers and network devices, and understand how NetBouncer's algorithm integrates troubleshooting domain knowledge with machine learning to overcome real-world data inconsistencies. Gain insights into the system's deployment in Microsoft Azure's data centers, its performance in detecting spine router gray failures, and its negligible overheads on the server side. Delve into the intricacies of active probing, path selection, device failure detection, and link failure inference as you examine this robust framework for maintaining data center network reliability.
Syllabus
Intro
This is a true story
Active probing system requires explicit and efficient probing
Observation vs. inference from path probing to failures
Real-world constraints complicate path selection
Device failure detection
Link failure inference: an optimization problem
Real world data inconsistency induces false positives
Evaluation questions
Real cases spine router gray failure
Accuracy comparison with previous algorithms
NetBouncer algorithm performance
NetBouncer has negligible averheads on the server side
Taught by
USENIX