Automatic Generation of Runtime Checkers for Production Distributed Systems
Strange Loop Conference via YouTube
Overview
Syllabus
Intro
Runtime checker (aka. detector/monitor)
Importance of runtime checker
Current checking practice
Complex internals of modern software
Common to exhibit gray failures
A real-world gray failure
Failure root cause
Ideal runtime checkers
A new approach
Panorama: capture in-situ observability
Convert a program into in-situ observer
Identify observation boundary and identities
Extract evidence
Example of analysis
Detecting real-world gray failures
Timeline of detecting failure case f1
Latency overhead to observers
Program reduction approach
Why doing reduction?
identify long-running regions
select checking target candidates
reduce long-running methods
encapsulate checkers
insert watchdog hooks
Prevent side effects
Watchdog generation
Failure detection evaluation setup
Detecting real-world failures
Silent semantic violations
Real-world failure study
Oathkeeper: detect silent semantic violation
How to express semantics?
Oathkeeper workflow
Emitting semantic event traces
General semantic rule templates
Extracted semantic rules
Runtime overhead
Conclusions
Taught by
Strange Loop Conference