Improve your reliability with modern operations practices

Go to class Write review

Module 1: Discover a map for navigating reliability challenges and sustainably achieving the appropriate level of reliability in your systems, services, and products.

By the end of this module, you will be able to:

Express why reliability is crucial to your success
Describe modern operations practices that offer tools you can use to work on your reliability challenges
Explain the Dickerson hierarchy of reliability and the map it provides for approaching reliability challenges

Module 2: Learn how to use monitoring to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.

In this module you will:

Learn how to increase your operational awareness as a precursor to reliability work
Expand your understanding of reliability itself
Change the way you frame your thinking about monitoring to make it more impactful
Gain a basic understanding of the applicable monitoring platform and tools available on Azure
Learn a practice from site reliability engineering that can immediately start to create an impact on reliability
Learn to craft actionable alerts to make your operational practices sustainable

Module 3: Learn the incident response fundamentals necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.

In this module you will:

Learn the importance of effective incident response
Gain an understanding of the lifecycle of an incident so we know just how to apply our efforts
Learn the building blocks for constructing an incident response process that allows us to respond with urgency.
Begin to track your incidents effectively using Azure DevOps tools.
Explore ways to automate your incident tracking for a speedy and consistent response
Understand the guidelines around communication that allow incident response to be more efficient
Visit some Azure tools that can significantly speed up your remediation times during an incident

Module 4: Learn about post-incident reviews, a practice necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.

In this module you will:

Discover the importance of learning from incidents
Understand the aspects of complex systems that make learning from failure important
Learn when and how to conduct a post-incident review
Understand the purpose and goals of a post-incident review
Learn the components that go into a good post-incident review
Explore the Azure tools that can assist with getting started with post-incident reviews
Become aware of common traps to avoid
Identify helpful practices to conduct a better review

Module 5: Learn about deployment practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.

In this module you will:

Learn about what software deployment is and different kinds of deployments we might employ
Discover the significant benefits of switching from an "epic deployment" model to a "continuous deployment" model
Explore the components of continuous deployment
Look deep into pipelines and how they are implemented in Azure Pipelines
Learn a number of different strategies for deployment to production that can help us avoid incidents
Examine some important best practices that can minimize the risk when rolling out new software or a new version of existing software

Module 6: Learn about capacity planning and scaling practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.

In this module you will:

Learn about scalability and the scalability/reliability relationship
Understand the role of capacity planning in preparing for growth
Learn basic concepts and fundamental terms related to scaling
Eliminate single points of failure
Understand the different kinds of growth and how to respond to them
Be able to measure capacity in the cloud
Catch issues with service limits and quotas before they emerge using Azure tools
Understand important steps to take before beginning work on scaling
List techniques for making an application more scalable includingdecoupling, queues, in-memory caching and database sharding
Learn about the Azure tools that make it possible to take yourapplication or service global