Time for Chaos - Understanding Chaos Engineering for Infrastructure Resilience

Overview

Explore Chaos Engineering, a branch of Site Reliability Engineering (SRE), in this 25-minute video presentation from Google Cloud Tech. Learn how to proactively test infrastructure resilience and reliability by simulating failures and injecting faults. Witness a demonstration using popular Chaos Engineering tools like Gremlin and Litmus on Google Kubernetes Engine. Discover fundamental aspects of Chaos Engineering platforms, including designing Chaos Workflows and simulating random pod deletion, network traffic degradation, and disk fill scenarios. Gain insights into the role of observability in Chaos Engineering and key metrics for determining application resiliency scores. Speaker Dharmesh Vaya guides you through the phases of Chaos, from establishing a steady state to formulating hypotheses and achieving end goals. Understand the average costs of infrastructure failures and best practices for implementing Chaos Engineering in your own systems.

Syllabus

Introduction
Average cost of Infra failures
Solution
Chaos Engineering
What is Chaos
Phases of Chaos
Steady State
Hypothesis
End Goal
Platforms
Experiment
Scenarios
Demo
Best Practices