Explore how tracing uncovers half-truths in Slack's CI infrastructure in this 23-minute conference talk from Strange Loop. Discover why traditional monitoring tools like logs and metrics were insufficient for debugging CI system failures. Learn how traces provided critical capabilities for understanding fault occurrences in interconnected systems such as GHE, Checkpoint, and Cypress. Gain insights into shared tooling for high-dimensionality event traces using SlackTrace and SpanEvents, and how they increased velocity in diagnosing code and debugging complex system interactions. Follow the journey from early incidents that motivated investment in internal tooling to improvements in performance and resiliency across Slack's infrastructure. Delve into topics including developer productivity, span event structure, shared dimensions, use cases, fuzzy service boundaries, incident command systems, and testing changes.
Overview
Syllabus
Intro
Developer Productivity
Span Event Structure
Whats Next
Shared Dimensions
Use Cases
The Root Challenge
The Results
Fuzzy Service Boundaries
Incident Command System
Testing Changes
Summary
Taught by
Strange Loop Conference