- Module 1: Discover a map for navigating reliability challenges and sustainably achieving the appropriate level of reliability in your systems, services, and products.
- Express why reliability is crucial to your success
- Describe modern operations practices that offer tools you can use to work on your reliability challenges
- Explain the Dickerson hierarchy of reliability and the map it provides for approaching reliability challenges
- Module 2: Learn how to use monitoring to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn how to increase your operational awareness as a precursor to reliability work
- Expand your understanding of reliability itself
- Change the way you frame your thinking about monitoring to make it more impactful
- Gain a basic understanding of the applicable monitoring platform and tools available on Azure
- Learn a practice from site reliability engineering that can immediately start to create an impact on reliability
- Learn to craft actionable alerts to make your operational practices sustainable
- Module 3: Learn the incident response fundamentals necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn the importance of effective incident response
- Gain an understanding of the lifecycle of an incident so we know just how to apply our efforts
- Learn the building blocks for constructing an incident response process that allows us to respond with urgency.
- Begin to track your incidents effectively using Azure DevOps tools.
- Explore ways to automate your incident tracking for a speedy and consistent response
- Understand the guidelines around communication that allow incident response to be more efficient
- Visit some Azure tools that can significantly speed up your remediation times during an incident
- Module 4: Learn about post-incident reviews, a practice necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Discover the importance of learning from incidents
- Understand the aspects of complex systems that make learning from failure important
- Learn when and how to conduct a post-incident review
- Understand the purpose and goals of a post-incident review
- Learn the components that go into a good post-incident review
- Explore the Azure tools that can assist with getting started with post-incident reviews
- Become aware of common traps to avoid
- Identify helpful practices to conduct a better review
- Module 5: Learn about deployment practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn about what software deployment is and different kinds of deployments we might employ
- Discover the significant benefits of switching from an "epic deployment" model to a "continuous deployment" model
- Explore the components of continuous deployment
- Look deep into pipelines and how they are implemented in Azure Pipelines
- Learn a number of different strategies for deployment to production that can help us avoid incidents
- Examine some important best practices that can minimize the risk when rolling out new software or a new version of existing software
- Module 6: Learn about capacity planning and scaling practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn about scalability and the scalability/reliability relationship
- Understand the role of capacity planning in preparing for growth
- Learn basic concepts and fundamental terms related to scaling
- Eliminate single points of failure
- Understand the different kinds of growth and how to respond to them
- Be able to measure capacity in the cloud
- Catch issues with service limits and quotas before they emerge using Azure tools
- Understand important steps to take before beginning work on scaling
- List techniques for making an application more scalable includingdecoupling, queues, in-memory caching and database sharding
- Learn about the Azure tools that make it possible to take yourapplication or service global
By the end of this module, you will be able to:
In this module you will:
In this module you will:
In this module you will:
In this module you will:
In this module you will: