Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Building Data Quality Pipelines with Apache Spark and Delta Lake

Databricks via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a fast-paced 27-minute video presentation by Databricks Technical Leads and Champions Darren Fuller and Sandy May on productionizing Data Quality Pipelines for enterprise customers. Learn about their vision to empower business decisions on data remediation actions and self-healing of Data Pipelines through a library of Data Quality rule templates, reporting Data Model, and PowerBI reports. Discover how the Lakehouse pattern emphasizes Data Quality at the Lake layer, utilizing tools like Delta Lake for schema protection and column checking. Watch quick-fire demos showcasing how Apache Spark can be leveraged for applying rules over data at Staging or Curation points. Gain insights into simple and complex rule applications, including net sales calculations, value validations, statistical distribution validations, and complex pattern matching. Get a glimpse of future work in Data Compliance for PII data, involving rule generation using regex patterns and Machine Learning-based transfer learning.

Syllabus

Intro
Problem Statement
Dirty Data
Build or Buy
Design Decisions
Microsoft Enterprise Data Warehouse
Demo
Summary

Taught by

Databricks

Reviews

Start your review of Building Data Quality Pipelines with Apache Spark and Delta Lake

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.