Handling Streaming Data with Azure Databricks Using Spark Structured Streaming

Overview

In this course, you will deep-dive into Spark Structured Streaming, see its features in action, and use it to build end-to-end, complex & reliable streaming pipelines using PySpark. And you will be using Azure Databricks platform to build & run them.

Modern data pipelines often include streaming data that needs to be processed in real-time. In a practical scenario, you would be required to deal with multiple streams and datasets, to continuously produce the results. In this course, Handling Streaming Data with Azure Databricks Using Spark Structured Streaming, you will learn how to use Spark Structured Streaming on Databricks platform, which is running on Microsoft Azure, and leverage its features to build end-to-end streaming pipelines. First, you will see a quick recap of Spark Structured Streaming processing model; understand the scenario that we will implement, and complete the environment setup. Next, you will learn how to configure sources and sinks, and build each phase of the streaming pipeline – by extracting the data from various sources, transforming it, and loading it into multiple sinks – Azure Data Lake, Azure Event Hubs, and Azure SQL. You will also see the different timestamps associated with an event, and how to aggregate data using Windows. Next, you will see how to combine a stream, with static or historical datasets. And how to combine multiple streams together. Finally, you will learn how to build a production ready pipeline, schedule it as a job in Databricks, and manage them using Databricks CLI. When you are finished with this course, you will be comfortable to build complex streaming pipelines, running on Azure Databricks, to solve a variety of business problems.