Recent Parquet Improvements in Apache Spark - Vectorized Complex Types and Column Index Support

Overview

Explore recent improvements in Apache Parquet performance within Apache Spark in this 37-minute talk from Databricks. Learn about vectorized read support for complex types, which can achieve 10x+ improvement when reading Parquet data with complex structures. Discover how Parquet column index support enhances predicate pushdown capabilities, allowing Spark to leverage this feature for more efficient data filtering. Gain insights into the differences between vectorized and non-vectorized Parquet readers, understand the importance of predicate pushdown in optimizing scan performance, and get a glimpse of future work items aimed at further enhancing Parquet read performance in Spark. Delve into technical concepts such as Parquet schema conversion, complex type support, and column index filtering to deepen your understanding of these performance optimizations.

Syllabus

Intro
Short Intro
Outline
Introduction on Apache Parquet
Parquet: Glossary
Parquet: Data Page
Background
Non-Vectorized Parquet Reader
Advantages of Vectorized Approach
High Level Idea
Parquet Schema Conversion
SPARK-34863: Complex type support
Complex Type - Performance
Perf: vectorized vs non-vectorized
Parquet Predicate Pushdown
Column Index Filtering
Column Index Support in Spark
Column Index - Performance
Future Work

Taught by

Databricks

Reviews

Start your review of Recent Parquet Improvements in Apache Spark - Vectorized Complex Types and Column Index Support

Taught by

The Parquet Format and Performance Optimization Opportunities

Deep Dive into New Features of Apache Spark 3.1

Never Stop Learning.