Fully Utilizing Spark for Data Validation with Fugue and Pandera

Overview

Explore data validation techniques for large-scale data pipelines in this 22-minute Databricks conference talk. Learn about the importance of data validation in interconnected data pipelines and compare popular frameworks like Great Expectations with lightweight alternatives. Discover how to extend Pandas-based validation libraries to Spark workflows using Fugue, an open-source framework. Gain insights into applying different validation rules for each partition in big data scenarios, addressing a common deficiency in current frameworks. Follow along with an interactive demo that combines Fugue and Pandera to create a flexible and efficient data validation solution for Spark. Understand the trade-offs between robust features and performance, and learn how to tailor your validation approach to your specific needs.

Syllabus

Intro
Case Study
Data Validation
Common Validations
Great Expectations - Detailed Results
Great Expectations - Data Documentation
Pandera-Sample Code
Comparison of Validation Frameworks
Fugue - Basic Code
Combining Fugue and Pandera
Example Data - Food Sloth's Pricing
Validation by Partition

Taught by

Databricks

Reviews

Start your review of Fully Utilizing Spark for Data Validation with Fugue and Pandera

Taught by

Large Scale Data Validation - with Spark and Dask

Data Quality Tools Comparison for Continuous Data Imports

Declarative ETL Pipelines with Delta Live Tables - Modern Software Engineering for Data Analysts and Engineers

How Apache Spark 3.0 and Delta Lake Enhance Data Lake Reliability

Optimizing Catalyst Optimizer for Complex Spark Plans

Democratizing Data Quality Through a Centralized Platform at Zillow

10 Best Pandas Courses for 2024

Never Stop Learning.