Recent Parquet Improvements in Apache Spark - Vectorized Complex Types and Column Index Support
Databricks via YouTube
Overview
Syllabus
Intro
Short Intro
Outline
Introduction on Apache Parquet
Parquet: Glossary
Parquet: Data Page
Background
Non-Vectorized Parquet Reader
Advantages of Vectorized Approach
High Level Idea
Parquet Schema Conversion
SPARK-34863: Complex type support
Complex Type - Performance
Perf: vectorized vs non-vectorized
Parquet Predicate Pushdown
Column Index Filtering
Column Index Support in Spark
Column Index - Performance
Future Work
Taught by
Databricks