Using Pandas and Dask to Work with Large Columnar Datasets in Apache Parquet

Overview

Explore efficient techniques for handling large columnar datasets in Apache Parquet using Pandas and Dask in this EuroPython 2018 conference talk. Dive into the Apache Parquet data format, understanding its binary and columnar structure, as well as its CPU and I/O optimization techniques. Learn how to leverage row groups, compression, and dictionary encoding to enhance data storage and retrieval. Discover methods for reading Parquet files into Pandas DataFrames using fastparquet and Apache Arrow libraries. Gain insights into working with data larger than memory or local disk space using Apache Dask, including partitioning and cloud object storage systems like Amazon S3 and Azure Storage. Master techniques such as metadata utilization, partition filenames, column statistics, and dictionary filtering to boost query performance on extensive datasets. Understand the benefits of partitioning, row group skipping, and optimal data layout for accelerating queries on large-scale data.

Syllabus

Intro
Outline
Business Model
Data Flow
Conclusion
Why do I care
Other technologies
Blob storage
Data sharing
Pocky
Why Parquet
Python implementations
Parquet file structure
Pre predicate pushdown
Dictionary encoding
Compression
Partitioning
Storage
ODBC
Azure Blob Storage
Questions

Taught by

EuroPython Conference

Reviews

Start your review of Using Pandas and Dask to Work with Large Columnar Datasets in Apache Parquet

Taught by

The Parquet Format and Performance Optimization Opportunities

From Data to Insights: 10 Best Data Analysis Courses for 2024

10 Best Pandas Courses for 2024

Never Stop Learning.