Develop a Site Reliability Engineering (SRE) strategy

Microsoft via Microsoft Learn

Go to class Write review

Details

Go to class

Provider

Microsoft Learn
Pricing

Free Online Course
Languages

English
Duration & workload

7 hours 43 minutes
Sessions

On-Demand
Level

Beginner

Found in

Overview

Module 1: Learn about SRE, an engineering discipline that helps you sustainably achieve the appropriate level of reliability in your systems, services, and products.
In this module you will:
- Gain a basic understanding of Site Reliability Engineering (SRE)
- Learn how to get started with this valuable operations practice
Module 2: Respond to incidents and activities in your infrastructure through alerting capabilities in Azure Monitor.
In this module, you'll:
- Configure alerts on events in your Azure resources based on metrics, log events, and activity log events.
- Learn how to use action groups in response to an alert, and how to use alert processing rules to override action groups when necessary.
Module 3: Learn about how to capture trace output from your Azure web apps. View a live log stream and download logs files for offline analysis.
In this module, you will:
- Enable application logging on an Azure Web App
- View live application logging activity with the log streaming service
- Retrieve application log files from an application with Kudu or the Azure CLI
Module 4: Learn how to manage site reliability.
After completing this module, you'll be able to:
- Describe how site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production.
- Describe how Application Insights analyzes the performance of your web application and can warn you about potential problems.
- List the processes that you can implement to monitor site reliability.
- Build a "just culture" that balances safety and accountability.
Module 5: Cloud Admin course from Dr. Majd Sakr at Carnegie Mellon University. Discover what cloud elasticity means and different ways to scale your cloud resources.
In this module you will:
- Describe common load patterns and how they drive the need to scale
- Enumerate the strategies and considerations in scaling cloud applications
- Discuss the advantages of auto-scaling and the mechanisms used to achieve it
- Describe the importance of load balancing in cloud applications and enumerate various methods to achieve it
- List the primary benefits of serverless computing and explain the concept of serverless functions
This content is provided in partnership with Dr. Majd Sakr and Carnegie Mellon University.
Module 6: Carnegie Mellon University's Cloud Developer course. Learn how developers write programs that run on the cloud, including how to deploy, be fault-tolerant, load balance, scale, and deal with latency.
In this module, you will:
- Evaluate different considerations when programming applications that run on clouds
- Evaluate different considerations when deploying applications on clouds
- Compare and contrast proactive and reactive measures for fault tolerance in cloud applications
- Describe the importance of load balancing in cloud applications and enumerate various methods to achieve it
- Enumerate the strategies and considerations in scaling cloud applications
- Motivate the case for minimizing tail latency and discuss the various strategies to reduce tail latency
- Describe the strategies to optimize total operational cost of using cloud services
In partnership with Dr. Majd Sakr and Carnegie Mellon University.
Module 7: Learn how to troubleshoot inbound network connectivity for Azure Load Balancer.
In this module, you will:
- Identify common Azure Load Balancer inbound connectivity issues.
- Identify steps to resolve issues when virtual machines aren't responding to health probe.
Module 8: Learn how to monitor the health of your Azure VMs by using Azure Metrics Explorer and metric alerts.
In this module, you will:
- Identify metrics and diagnostic data that you can collect for virtual machines
- Configure monitoring for a virtual machine
- Use monitoring data to diagnose problems

Syllabus

Module 1: Module 1: Introduction to Site Reliability Engineering (SRE)
- Introduction to Site Reliability Engineering
- What is SRE and why does it matter?
- SRE in context
- Key SRE principles and practices: virtuous cycles
- Key SRE principles and practices: The human side of SRE
- Getting started with SRE
- Summary
Module 2: Module 2: Improve incident response with alerting on Azure
- Introduction
- Explore the different alert types that Azure Monitor supports
- Use metric alerts for alerts about performance issues in your Azure environment
- Exercise - Use metric alerts to alert on performance issues in your Azure environment
- Use log alerts to alert on events in your application
- Use activity log alerts to alert on events within your Azure infrastructure
- Use action groups and alert processing rules to send notifications when an alert is fired
- Exercise -Use an activity log alert and an action group to notify users about events in your Azure infrastructure
- Summary
Module 3: Module 3: Capture Web Application Logs with App Service Diagnostics Logging
- Introduction
- Enable and configure App Service application logging
- Exercise - Enable and configure App Service application logging using the Azure portal
- View live application logging with the log streaming service
- Exercise - View live application logging with the log streaming service using Azure CLI
- Retrieve application log files
- Exercise - Retrieve Application Log Files using Azure CLI and Kudu
- Summary
Module 4: Module 4: Manage site reliability
- Introduction
- What is reliability engineering?
- What is Application Insights?
- Perform ongoing tuning to reduce meaningless alerts
- Analyze alerts to establish a baseline
- Blameless postmortems
- Knowledge check
- Summary
Module 5: Module 5: Scale your cloud resources with elasticity
- Introduction
- Compute load patterns
- Scaling compute resources
- Automated scaling on the cloud
- Load balancing
- Serverless computing
- Summary
Module 6: Module 6: Build applications on the cloud
- Introduction
- Programming the cloud
- Deploy applications on the cloud
- Build fault-tolerant cloud services
- Load balancing
- Scale resources
- How to deal with tail latency
- Economics for cloud applications
- Summary
Module 7: Module 7: Troubleshoot inbound network connectivity for Azure Load Balancer
- Introduction
- Troubleshoot Azure Load Balancer
- Diagnose issues by reviewing configurations and metrics
- Exercise - Set up your environment
- Exercise - Identify and resolve inbound network connectivity
- Summary
Module 8: Module 8: Monitor the health of your Azure virtual machine by using Azure Metrics Explorer and metric alerts
- Introduction
- Monitor the health of the virtual machine
- Exercise - Set up a VM with boot diagnostics
- View VM metrics
- Configure the Azure Diagnostics extension
- Exercise - Configure the Azure Diagnostics extension
- Diagnostic data case studies
- Exercise - Use diagnostic data
- Summary