Designing the Next Generation of Data Pipelines with Apache Spark - Zillow's Approach
Databricks via YouTube
Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore how Zillow's data engineering team revolutionized their data pipeline architecture using Apache Spark in this 27-minute conference talk. Learn about the challenges of balancing development speed with pipeline maintainability in a rapidly evolving organization. Discover how Zillow identified and addressed technical debt, improved data quality enforcement, consolidated shared pipeline functionality, and implemented scalable complex business logic. Gain insights into the process of designing a new end-to-end pipeline architecture that enhances robustness, maintainability, and scalability while reducing code complexity. Understand the pain points in pipeline development, maintenance, and scaling, and explore the pros and cons of various ETL patterns. Delve into Zillow's approach to creating more scalable and robust data pipelines using Apache Spark, including the establishment of processing layers, the development of a Pipeler Library, config-driven orchestration, separation of data processing and business logic, and early data validation techniques.
Syllabus
Intro
What is Zillow Offers?
Original Architecture
New Architecture
Establish Processing Layers
Pipeler Library
Config-driven Orchestration
Data Processing vs. Business Logic
Validating Data Early
Taught by
Databricks