How Apache Spark 3.0 and Delta Lake Enhance Data Lake Reliability

Overview

Explore how Apache Spark 3.0 and Delta Lake enhance data lake reliability in this 58-minute webinar from Databricks. Learn about Apache Spark's role in big data processing, the evolution of data lake architectures, and Delta Lake's capabilities for ensuring reliable data. Discover how unified batch and streaming simplifies architectures. Dive into Spark 3.0's new features, including the Adaptive Query Execution framework for improved query performance, Dynamic Partition Pruning for faster processing in star schema designs, and accelerator-aware scheduling for GPU optimization. Examine new Pandas UDF types and function APIs, as well as enhanced monitoring capabilities. Gain insights into Delta Lake 0.7.0, the Spark Catalyst Optimizer, and the Lakehouse paradigm. Understand how ACID transactions, schema enforcement, and time travel contribute to data reliability. Learn about data quality frameworks, improved SQL capabilities, and the integration of DataSourceV2 and Catalog API.

Syllabus

Intro
databricks
Deep Dive into the New Features of Apache Spark 3.0
A Delta Lake 0.7.0 + Spark 3.0 AMA
Spark Catalyst Optimizer
Adaptive Query Execution AQE
Apache SparkTM 3.0 AQE Fundamentals
Starting with Broadcast Hash Joins
Dynamically Switching Join Strategies Apache Spark 3.0 NE Fundamentals
Dynamically Coalescing Shuffle Partitions Apache Spark 3.0 ADÉ Fundamentals
Dynamically Optimize Skew Joins
TPC-DS performance gains from AQE
Dynamic Partition Pruning: Before Optimiza
How to Use Join Hints? Broadcast Hash Join
Extensibility and Ecosystem
Data Source V2
But what happens with DML under the cover What really happens to the file system when you run delete update and merge?
Time Travel The transaction log and additive files - data versioning
Control Table History Retention
Enable DataSourceV2 and Catalog API Integration
Data Quality Framework Improved SOL DOL and DMLS and ACID Transactions are just the start
Lakehouse Paradigm Improved Performance. DW-like capabilities, on low cost cloud object stores
Try out Spark 3.0 + Delta Lake now!