Overview
Explore architecting for data quality in the lakehouse with Delta Lake and PySpark in this comprehensive tech talk. Learn how to combat data downtime by adopting DevOps and software engineering best practices. Discover techniques for identifying, resolving, and preventing data issues across the data lakehouse. Gain insights into optimizing data reliability across metadata, storage, and query engine tiers. Build your own data observability monitors using PySpark and understand the role of tools like Delta Lake in scaling this design. Dive into topics such as the Data Quality Cone of Anxiety, data observability principles, and the Data Reliability Lifecycle. Examine the differences between data lakes and warehouses, and explore practical examples of measuring update times, loading data, and feature engineering. Access accompanying exercises and Jupyter notebooks to apply your newfound knowledge in real-world scenarios.
Syllabus
Intro
Welcome
Introductions
Agenda
Data Quality Cone of Anxiety
How do we address bad data
What is data observability
Freshness
Distribution
Volume
Schema
Data Lineage
Data Reliability Lifecycle
Lake vs Warehouse
Metadata
Storage
Query logs
Query engine
Questions
Describe Detail
Architecture for observability
Measuring update times
Loading data in CSV or JSON
Update cadence
Feature engineering
Lambda function
Delay between updates
Model Parameters
Training Labels
Questions and Answers
Summary
Upcoming events
Data Quality Fundamentals
Monte Carlo
Taught by
Databricks