Overview
Explore a comprehensive 54-minute conference talk from Databricks on building a centralized platform for data quality management. Learn how Zillow tackled the challenge of ensuring data quality across thousands of datasets and pipelines. Discover the five pillars of their data quality platform and its architecture. Gain insights into self-service onboarding processes, including data discovery, rule-based approaches, and monitoring. Understand how validation libraries and pipeline integration work to flag data quality issues early. Examine the platform's capabilities in defining and viewing data quality expectations, performing validations using Spark, and dynamically generating pipelines. See how data quality metrics are exposed alongside datasets to provide a comprehensive health picture over time. Conclude with future directions and key takeaways for implementing a robust data quality management system in complex data organizations.
Syllabus
Intro
About Zillow
Why Monitor Data Quality?
Challenges we Faced
5 Pillars for Data Quality Platform
Platform Architecture
Self-Service Onboarding - Goals
Self-Service Onboarding . Data Discovery
Self-Service Onboarding. Rule-based
Self-Service Onboarding Example
Self-Service Onboarding - Metrics
Self-Service Onboarding . Monitoring
Behind the Scenes
Validation Libraries
Pipeline Integration before
Pipeline Integration (after)
Validation Results
Future Direction
Key Takeaways
Taught by
Databricks