Learn about continuous reliability practices at Grafana Labs in this technical conference talk that reveals real-world challenges and solutions in maintaining observability tools. Explore how the company solved a costly mystery exceeding $100,000, successfully scaled Mimir clusters to handle 1.3 billion time series metrics, and optimized Loki clusters to process 324 TB of daily logs. Gain insights into the internal monitoring dashboards used for Grafana Cloud and discover valuable lessons learned from production incidents and system failures. Through candid discussions of past challenges and current improvements, understand the practical aspects of implementing observability at scale and maintaining reliability in complex microservices-based systems.
Watching the Watchers: How We Do Continuous Reliability at Grafana Labs
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Syllabus
Watching the Watchers: How We Do Continuous Reliability at Grafana Labs - Nicole van der Hoeven
Taught by
CNCF [Cloud Native Computing Foundation]