Explore the design and implementation of a real-time data lake capable of handling dynamically changing schemas in this 25-minute presentation from Databricks. Learn how to build a robust streaming ETL pipeline that can adapt to changing schemas and new event types without downtime. Discover techniques for inferring schemas on the fly, tracking and storing schemas without a schema registry, and adjusting underlying tables automatically. Gain insights into deploying and managing hundreds of streams operationally on Databricks, and understand the cost and performance implications for growing ingestion loads from data providers. Dive into key topics such as schema variation hashing, batch processing, schema repository management, and essential takeaways for implementing this approach in production environments.
Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema
Databricks via YouTube
Overview
Syllabus
Intro
SEGA
Key Requirements
Sample Data
Schema Changes
Schema Variation Hash
Foreach Batch
Update the Schema
Schema Repository
Retrieve the schema
Management Stream
Key takeaways
Taught by
Databricks