What you'll learn:
- Data Engineering leveraging Databricks features
- Databricks CLI to manage files, Data Engineering jobs and clusters for Data Engineering Pipelines
- Deploying Data Engineering applications developed using PySpark on job clusters
- Deploying Data Engineering applications developed using PySpark using Notebooks on job clusters
- Perform CRUD Operations leveraging Delta Lake using Spark SQL for Data Engineering Applications or Pipelines
- Perform CRUD Operations leveraging Delta Lake using Pyspark for Data Engineering Applications or Pipelines
- Setting up development environment to develop Data Engineering applications using Databricks
- Building Data Engineering Pipelines using Spark Structured Streaming on Databricks Clusters
- Incremental File Processing using Spark Structured Streaming leveraging Databricks Auto Loader cloudFiles
- Overview of Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
- Differences between Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
- Differences between traditional Spark Structured Streaming and leveraging Databricks Auto Loader cloudFiles for incremental file processing.
As part of this course, you will learn all the Data Engineering using cloud platform-agnostic technology called Databricks.
About Data Engineering
Data Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETLDevelopment, Data Warehouse Development, etc.
About Databricks
Databricks is the most popular cloud platform-agnostic data engineering tech stack. They are the committers of the Apache Spark project. Databricks run time provide Spark leveraging the elasticity of the cloud. With Databricks, you pay for what you use. Over a period of time, they came up with the idea of Lakehouse by providing all the features that are required for traditional BIas well as AI&ML. Here are some of the core features of Databricks.
Spark - Distributed Computing
Delta Lake - Perform CRUD Operations. It is primarily used to build capabilities such as inserting, updating, and deleting the data from files in Data Lake.
cloudFiles - Get the files in an incremental fashion in the most efficient way leveraging cloud features.
Databricks SQL - A Photon-based interface that is fine-tuned for running queries submitted for reporting and visualization by reporting tools. It is also used for Ad-hoc Analysis.
Course Details
As part of this course, you will be learning Data Engineering using Databricks.
Getting Started with Databricks
Setup Local Development Environment to develop Data Engineering Applications using Databricks
Using Databricks CLI to manage files, jobs, clusters, etc related to Data Engineering Applications
Spark Application Development Cycle to build Data Engineering Applications
Databricks Jobs and Clusters
Deploy and Run Data Engineering Jobs on Databricks Job Clusters as Python Application
Deploy and Run Data Engineering Jobs on Databricks Job Clusters using Notebooks
Deep Dive into Delta Lake using Dataframes on Databricks Platform
Deep Dive into Delta Lake using Spark SQL on Databricks Platform
Building Data Engineering Pipelines using Spark Structured Streaming on Databricks Clusters
Incremental File Processing using Spark Structured Streaming leveraging Databricks Auto Loader cloudFiles
Overview of AutoLoader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between traditional Spark Structured Streaming and leveraging Databricks Auto Loader cloudFiles for incremental file processing.
Overview of Databricks SQLfor Data Analysis and reporting.
We will be adding a few more modules related to Pyspark, Spark with Scala, Spark SQL, and Streaming Pipelines in the coming weeks.
Desired Audience
Here is the desired audience for this advanced course.
Experienced application developers to gain expertise related to Data Engineering with prior knowledge and experience of Spark.
Experienced Data Engineers to gain enough skills to add Databricks to their profile.
Testers to improve their testing capabilities related to Data Engineering applications using Databricks.
Prerequisites
Logistics
Computer with decent configuration (At least 4 GB RAM, however 8 GBis highly desired)
Dual Core is required and Quad-Core is highly desired
Chrome Browser
High-Speed Internet
Valid AWS Account
Valid Databricks Account (free Databricks Account is not sufficient)
Experience as Data Engineer especially using Apache Spark
Knowledge about some of the cloudconcepts such as storage, users, roles, etc.
Associated Costs
As part of the training, you will only get the material. You need to practice on your own or corporate cloud account and Databricks Account.
You need to take care of the associated AWS or Azure costs.
You need to take care of the associated Databricks costs.
Training Approach
Here are the details related to the training approach.
It is self-paced with reference material, code snippets, and videos provided as part of Udemy.
One needs to sign up for their own Databricks environment to practice all the core features of Databricks.
We would recommend completing 2 modules every week by spending 4 to 5 hours per week.
It is highly recommended to take care of all the tasks so that one can get real experience of Databricks.
Support will be provided through Udemy Q&A.
Here is the detailed course outline.
Getting Started with Databricks on Azure
As part of this section, we will go through the details about signing up to Azure and setup the Databricks cluster on Azure.
Getting Started with Databricks on Azure
Signup for the Azure Account
Login and Increase Quotas for regional vCPUs in Azure
Create Azure Databricks Workspace
Launching Azure Databricks Workspace or Cluster
Quick Walkthrough of Azure Databricks UI
Create Azure Databricks Single Node Cluster
Upload Data using Azure Databricks UI
Overview of Creating Notebook and Validating Files using Azure Databricks
Develop Spark Application using Azure Databricks Notebook
Validate Spark Jobs using Azure Databricks Notebook
Export and Import of Azure Databricks Notebooks
Terminating Azure Databricks Cluster and Deleting Configuration
Delete Azure Databricks Workspace by deleting Resource Group
Azure Essentials for Databricks - Azure CLI
As part of this section, we will go through the details about setting up Azure CLIto manage Azure resources using relevant commands.
Azure Essentials for Databricks - Azure CLI
Azure CLI using Azure Portal Cloud Shell
Getting Started with Azure CLI on Mac
Getting Started with Azure CLI on Windows
Warming up with Azure CLI - Overview
Create Resource Group using Azure CLI
Create ADLS Storage Account with in Resource Group
Add Container as part of Storage Account
Overview of Uploading the data into ADLS File System or Container
Setup Data Set locally to upload into ADLS File System or Container
Upload local directory into Azure ADLS File System or Container
Delete Azure ADLS Storage Account using Azure CLI
Delete Azure Resource Group using Azure CLI
Mount ADLS on to Azure Databricks to access files from Azure Blob Storage
As part of this section, we will go through the details related to mounting Azure Data Lake Storage (ADLS)on to Azure Databricks Clusters.
Mount ADLS on to Azure Databricks - Introduction
Ensure Azure Databricks Workspace
Setup Databricks CLI on Mac or Windows using Python Virtual Environment
Configure Databricks CLI for new Azure Databricks Workspace
Register an Azure Active Directory Application
Create Databricks Secret for AD Application Client Secret
Create ADLS Storage Account
Assign IAM Role on Storage Account to Azure AD Application
Setup Retail DB Dataset
Create ADLS Container or File System and Upload Data
Start Databricks Cluster to mount ADLS
Mount ADLS Storage Account on to Azure Databricks
Validate ADLS Mount Point on Azure Databricks Clusters
Unmount the mount point from Databricks
Delete Azure Resource Group used for Mounting ADLS on to Azure Databricks
Setup Local Development Environment for Databricks
As part of this section, we will go through the details related to setting up of local development environment for Databricks using tools such as Pycharm, Databricks dbconnect, Databricks dbutils, etc.
Setup Single Node Databricks Cluster
Install Databricks Connect
Configure Databricks Connect
Integrating Pycharm with Databricks Connect
Integrate Databricks Cluster with Glue Catalog
Setup AWS s3 Bucket and Grant Permissions
Mounting s3 Buckets into Databricks Clusters
Using Databricks dbutils from IDEs such as Pycharm
Using Databricks CLI
As part of this section, we will get an overview of Databricks CLI to interact with Databricks File System or DBFS.
Introduction to Databricks CLI
Install and Configure Databricks CLI
Interacting with Databricks File System using Databricks CLI
Getting Databricks Cluster Details using Databricks CLI
Databricks Jobs and Clusters
As part of this section, we will go through the details related to Databricks Jobs and Clusters.
Introduction to Databricks Jobs and Clusters
Creating Pools in Databricks Platform
Create Cluster on Azure Databricks
Request to Increase CPU Quota on Azure
Creating Job on Databricks
Submitting Jobs using Databricks Job Cluster
Create Pool in Databricks
Running Job using Interactive Databricks Cluster Attached to Pool
Running Job Using Databricks Job Cluster Attached to Pool
Exercise - Submit the application as a job using Databricks interactive cluster
Deploy and Run Spark Applications on Databricks
As part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications.
Prepare PyCharm for Databricks
Prepare Data Sets
Move files to ghactivity
Refactor Code for Databricks
Validating Data using Databricks
Setup Data Set for Production Deployment
Access File Metadata using Databricks dbutils
Build Deployable bundle for Databricks
Running Jobs using Databricks Web UI
Get Job and Run Details using Databricks CLI
Submitting Databricks Jobs using CLI
Setup and Validate Databricks Client Library
Resetting the Job using Databricks Jobs API
Run Databricks Job programmatically using Python
Detailed Validation of Data using Databricks Notebooks
Deploy and Run Spark Jobs using Notebooks
As part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications using Databricks Notebooks.
Modularizing Databricks Notebooks
Running Job using Databricks Notebook
Refactor application as Databricks Notebooks
Run Notebook using Databricks Development Cluster
Deep Dive into Delta Lake using Spark Data Frames on Databricks
As part of this section, we will go through all the important details related to Databricks Delta Lake using Spark Data Frames.
Introduction to Delta Lake using Spark Data Frames on Databricks
Creating Spark Data Frames for Delta Lake on Databricks
Writing Spark Data Frame using Delta Format on Databricks
Updating Existing Data using Delta Format on Databricks
Delete Existing Data using Delta Format on Databricks
Merge or Upsert Data using Delta Format on Databricks
Deleting using Merge in Delta Lake on Databricks
Point in Snapshot Recovery using Delta Logs on Databricks
Deleting unnecessary Delta Files using Vacuum on Databricks
Compaction of Delta Lake Files on Databricks
Deep Dive into Delta Lake using Spark SQL on Databricks
As part of this section, we will go through all the important details related to Databricks Delta Lake using Spark SQL.
Introduction to Delta Lake using Spark SQL on Databricks
Create Delta Lake Table using Spark SQLon Databricks
Insert Data to Delta Lake Table using Spark SQLon Databricks
Update Data in Delta Lake Table using Spark SQLon Databricks
Delete Data from Delta Lake Table using Spark SQLon Databricks
Merge or Upsert Data into Delta Lake Table using Spark SQLon Databricks
Using Merge Function over Delta Lake Table using Spark SQLon Databricks
Point in Snapshot Recovery using Delta Lake Table using Spark SQLon Databricks
Vacuuming Delta Lake Tables using Spark SQLon Databricks
Compaction of Delta Lake Tables using Spark SQLon Databricks
Accessing Databricks Cluster Terminal via Web as well as SSH
As part of this section, we will see how to access terminal related to Databricks Cluster via Web as well as SSH.
Enable Web Terminal in Databricks Admin Console
Launch Web Terminal for Databricks Cluster
Setup SSH for the Databricks Cluster Driver Node
Validate SSH Connectivity to the Databricks Driver Node on AWS
Limitations of SSH and comparison with Web Terminal related to Databricks Clusters
Installing Softwares on Databricks Clusters using init scripts
As part of this section, we will see how to bootstrap Databricks clusters by installing relevant 3rd party libraries for our applications.
Setup gen_logs on Databricks Cluster
Overview of Init Scripts for Databricks Clusters
Create Script to install software from git on Databricks Cluster
Copy init script to dbfs location
Create Databricks Standalone Cluster with init script
Quick Recap of Spark Structured Streaming
As part of this section, we will get a quick recap of Spark Structured streaming.
Validate Netcat on Databricks Driver Node
Push log messages to Netcat Webserver on Databricks Driver Node
Reading Web Server logs using Spark Structured Streaming
Writing Streaming Data to Files
Incremental Loads using Spark Structured Streaming on Databricks
As part of this section, we will understand how to perform incremental loads using Spark Structured Streaming on Databricks.
Overview of Spark Structured Streaming
Steps for Incremental Data Processing on Databricks
Configure Databricks Cluster with Instance Profile
Upload GHArchive Files to AWS s3 using Databricks Notebooks
Read JSON Data using Spark Structured Streaming on Databricks
Write using Delta file format using Trigger Once on Databricks
Analyze GHArchive Data in Delta files using Spark on Databricks
Add New GHActivity JSON files on Databricks
Load Data Incrementally to Target Table on Databricks
Validate Incremental Load on Databricks
Internals of Spark Structured Streaming File Processing on Databricks
Incremental Loads using autoLoader Cloud Files on Databricks
As part of this section we will see how to perform incremental loads using autoLoader cloudFiles on Databricks Clusters.
Overview of AutoLoader cloudFiles on Databricks
Upload GHArchive Files to s3 on Databricks
Write Data using AutoLoader cloudFiles on Databricks
Add New GHActivity JSON files on Databricks
Load Data Incrementally to Target Table on Databricks
Add New GHActivity JSON files on Databricks
Overview of Handling S3 Events using AWS Services on Databricks
Configure IAM Role for cloudFiles file notifications on Databricks
Incremental Load using cloudFiles File Notifications on Databricks
Review AWS Services for cloudFiles Event Notifications on Databricks
Review Metadata Generated for cloudFiles Checkpointing on Databricks
Overview of Databricks SQL Clusters
As part of this section, we will get an overview of Databricks SQLClusters.
Overview of Databricks SQL Platform - Introduction
Run First Query using SQL Editor of Databricks SQL
Overview of Dashboards using Databricks SQL
Overview of Databricks SQL Data Explorer to review Metastore Databases and Tables
Use Databricks SQL Editor to develop scripts or queries
Review Metadata of Tables using Databricks SQL Platform
Overview of loading data into retail_db tables
Configure Databricks CLI to push data into the Databricks Platform
Copy JSON Data into DBFS using Databricks CLI
Analyze JSON Data using Spark APIs
Analyze Delta Table Schemas using Spark APIs
Load Data from Spark Data Frames into Delta Tables
Run Adhoc Queries using Databricks SQL Editor to validate data
Overview of External Tables using Databricks SQL
Using COPY Command to Copy Data into Delta Tables
Manage Databricks SQL Endpoints