Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Dive into the intricacies of the Parquet format and explore performance optimization opportunities in this 41-minute conference talk by Boudewijn Braams from Databricks. Begin with an introduction to structured data formats and physical data storage models, including row-wise, columnar, and hybrid approaches. Delve deeper into the specifics of the Parquet format, examining its disk representation, physical data organization, and encoding schemes. Learn about various performance optimization techniques such as dictionary encoding, page compression, predicate pushdown, dictionary filtering, and partitioning schemes. Discover strategies to combat the issue of 'many small files' and gain insights into the open-source Delta Lake format in relation to Parquet. Suitable for both newcomers seeking an approachable refresher on columnar storage and experienced professionals looking to optimize analytical workloads in Spark, this talk provides tangible tips and tricks to leverage the Parquet format for improved performance.
Syllabus
Intro
Data processing and analytics
Overview
Data sources and formats
Physical storage layout models
Different workloads
Row-wise vs Columnar
Parquet: data organization Data organization
Parquet: encoding schemes
Optimization: dictionary encoding
Optimization: predicate pushdown
Optimization: partitioning • Embed predicates in directory structure
Optimization: avoid many small files
Optimization: avoid few huge files
Optimization: Delta Lake • Open-source storage layer on top of Parquet in Spark
Conclusion
Taught by
Databricks