Building Data Quality Pipelines with Apache Spark and Delta Lake

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Grab it

Explore a fast-paced 27-minute video presentation by Databricks Technical Leads and Champions Darren Fuller and Sandy May on productionizing Data Quality Pipelines for enterprise customers. Learn about their vision to empower business decisions on data remediation actions and self-healing of Data Pipelines through a library of Data Quality rule templates, reporting Data Model, and PowerBI reports. Discover how the Lakehouse pattern emphasizes Data Quality at the Lake layer, utilizing tools like Delta Lake for schema protection and column checking. Watch quick-fire demos showcasing how Apache Spark can be leveraged for applying rules over data at Staging or Curation points. Gain insights into simple and complex rule applications, including net sales calculations, value validations, statistical distribution validations, and complex pattern matching. Get a glimpse of future work in Data Compliance for PII data, involving rule generation using regex patterns and Machine Learning-based transfer learning.

Syllabus

Intro
Problem Statement
Dirty Data
Build or Buy
Design Decisions
Microsoft Enterprise Data Warehouse
Demo
Summary

Taught by

Databricks

Reviews

Start your review of Building Data Quality Pipelines with Apache Spark and Delta Lake

Taught by

Implement a Data Analytics Solution with Azure Synapse Analytics

Get started with Microsoft Fabric

Data Engineering with Databricks

Make Reliable ETL Easy on Delta Lake

Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale

Architecting for Data Quality in the Lakehouse with Delta Lake and PySpark

Never Stop Learning.