Overview

In this course, you will learn how to harness the power of Apache Spark and powerful clusters running on the Azure Databricks platform to run data science workloads in the cloud. This is the fourth course in a five-course program that prepares you to take the DP-100: Designing and Implementing a Data Science Solution on Azurec ertification exam. The certification exam is an opportunity to prove knowledge and expertise operate machine learning solutions at a cloud-scale using Azure Machine Learning. This specialization teaches you to leverage your existing knowledge of Python and machine learning to manage data ingestion and preparation, model training and deployment, and machine learning solution monitoring in Microsoft Azure. Each course teaches you the concepts and skills that are measured by the exam. This Specialization is intended for data scientists with existing knowledge of Python and machine learning frameworks like Scikit-Learn, PyTorch, and Tensorflow, who want to build and operate machine learning solutions in the cloud. It teaches data scientists how to create end-to-end solutions in Microsoft Azure. Students will learn how to manage Azure resources for machine learning; run experiments and train models; deploy and operationalize machine learning solutions, and implement responsible machine learning. They will also learn to use Azure Databricks to explore, prepare, and model data; and integrate Databricks machine learning processes with Azure Machine Learning.

Syllabus

Introduction to Azure Databricks

In this module, you will discover the capabilities of Azure Databricks and the Apache Spark notebook for processing huge files. You will come to understand the Azure Databricks platform and identify the types of tasks well-suited for Apache Spark. You will also be introduced to the architecture of an Azure Databricks Spark Cluster and Spark Jobs.

Working with data in Azure Databricks

Azure Databricks supports day-to-day data-handling functions, such as reads, writes, and queries. In this module, you will work with large amounts of data from multiple sources in different raw formats. You will also learn to use the DataFrame Column Class Azure Databricks to apply column-level transformations, such as sorts, filters and aggregations. You will also use advanced DataFrame functions operations to manipulate data, apply aggregates, and perform date and time operations in Azure Databricks.

Processing data in Azure Databricks

Azure Databricks supports a range of built in SQL functions, however, sometimes you have to write custom function, known as User-Defined Function (UDF). In this module, you will learn how to register and invoke UDFs. You will also learn how to use Delta Lake to create, append, and upsert data to Apache Spark tables, taking advantage of built-in reliability and optimizations.

Get started with Databricks and machine learning

In this module, you will learn how to use PySpark’s machine learning package to build key components of the machine learning workflows that include exploratory data analysis, model training, and model evaluation. You will also learn how to build pipelines for common data featurization tasks.

Manage machine learning lifecycles and fine tune models

In this module, you will learn how to use MLflow to track machine learning experiments and how to use modules from the Spark’s machine learning library for hyperparameter tuning and model selection.

Train a distributed neural network and serve models with Azure Machine Learning

In this module, you will learn how to use the Uber’s Horovod framework along with the Petastorm library to run distributed, deep learning training jobs on Spark using training datasets in the Apache Parquet format. You will also learn how to use MLflow and Azure Machine Learning service register, package, and deploy a trained model to both Azure Container Instance, and Azure Kubernetes Service as a scoring web service.