Apache Arrow and the Future of Data Frames with Wes McKinney

Apache Arrow and the Future of Data Frames with Wes McKinney

Association for Computing Machinery (ACM) via YouTube Direct link

Limited data types Excessive memory consumption Poor processing efficiency for non-numeric types Accommodate larger-than-memory datasets

10 of 13

10 of 13

Limited data types Excessive memory consumption Poor processing efficiency for non-numeric types Accommodate larger-than-memory datasets

Class Central Classrooms beta

YouTube playlists curated by Class Central.

Classroom Contents

Apache Arrow and the Future of Data Frames with Wes McKinney

Automatically move to the next video in the Classroom when playback concludes

  1. 1 Apache Arrow and the Future of Data Frames
  2. 2 Career Theme Programming interfaces for data preparation, analytics, and feature engineering
  3. 3 What exactly is a data frame?
  4. 4 A data frame is ... a programming interface ... for expressing data manipulations
  5. 5 Data frames address many analytical workloads that are either not possible or not well-served by traditional SQL-based systems
  6. 6 In R, the "data frame" data structure is part of the language Other projects implement their own (e.g. pandas) • Some projects may not use any data structures (e.g. compiling operations to SQL)
  7. 7 Most data frames are effectively "Islands" with a hard serialization barrier • Many non-reusable implementations of the same algorithms • Limited collaboration across projects and programming languag…
  8. 8 Apache Arrow Open source community project launched in 2016 • Intersection of database systems, big data, and data science tools • Purpose: Language independent open standards and libraries to accele…
  9. 9 Improve interoperability problems with other data processing systems . Standardize data structures used in data frame implementations • Promote collaboration and code reuse across libraries and progr…
  10. 10 Limited data types Excessive memory consumption Poor processing efficiency for non-numeric types Accommodate larger-than-memory datasets
  11. 11 Apache Arrow Project Overview Language-agnostic in-memory columnar format for analytical query engines, data frames • Binary protocol for IPC/RPC . "Batteries included" development platform for build…
  12. 12 Arrow and the Future of Data Frames . As more data sources offer Arrow-based data access, it will make sense to process Arrow in situ rather than converting to some other data structure • Analytical …
  13. 13 Runtime memory format for analytical query processing • Ideal companion to columnar storage like Apache Parquet Fully shredded columnar, supports flat and nested schemas Organized for cache-efficient…

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.