Apache Arrow and the Future of Data Frames with Wes McKinney
Association for Computing Machinery (ACM) via YouTube
Overview
Syllabus
Apache Arrow and the Future of Data Frames
Career Theme Programming interfaces for data preparation, analytics, and feature engineering
What exactly is a data frame?
A data frame is ... a programming interface ... for expressing data manipulations
Data frames address many analytical workloads that are either not possible or not well-served by traditional SQL-based systems
In R, the "data frame" data structure is part of the language Other projects implement their own (e.g. pandas) • Some projects may not use any data structures (e.g. compiling operations to SQL)
Most data frames are effectively "Islands" with a hard serialization barrier • Many non-reusable implementations of the same algorithms • Limited collaboration across projects and programming languages
Apache Arrow Open source community project launched in 2016 • Intersection of database systems, big data, and data science tools • Purpose: Language independent open standards and libraries to accelerate and simplify in-memory computing
Improve interoperability problems with other data processing systems . Standardize data structures used in data frame implementations • Promote collaboration and code reuse across libraries and programming languages
Limited data types Excessive memory consumption Poor processing efficiency for non-numeric types Accommodate larger-than-memory datasets
Apache Arrow Project Overview Language-agnostic in-memory columnar format for analytical query engines, data frames • Binary protocol for IPC/RPC . "Batteries included" development platform for building data processing applications
Arrow and the Future of Data Frames . As more data sources offer Arrow-based data access, it will make sense to process Arrow in situ rather than converting to some other data structure • Analytical systems will generally grow more efficient the more "Arrow-native" they become
Runtime memory format for analytical query processing • Ideal companion to columnar storage like Apache Parquet Fully shredded columnar, supports flat and nested schemas Organized for cache-efficient access on CPU/GPUS Optimized for data locality, SIMD, parallel processing Accommodates both random access and scan workloads
Taught by
Association for Computing Machinery (ACM)