Comparing the Different Ways to Scale Python and Pandas Code

Overview

Explore the different approaches to scaling Python and Pandas code in this PyCon US talk. Learn about Fugue, an open-source unified interface for Pandas, Spark, and Dask that enables scale-agnostic compute workflows. Discover how to decouple logic and execution, allowing you to code in familiar languages like Python, Pandas, or SQL, and choose your preferred execution engine. Dive into the transform() function, which facilitates distributed execution of single functions. Understand Pandas limitations, distributed computing frameworks, and how Fugue reduces the barrier to entry for distributed computing. Compare eager and lazy evaluation, examine expectations versus reality in data processing, and explore Spark solutions using traditional SQL syntax. Gain insights into leveraging Python and SQL for efficient code size and execution time in large-scale data processing tasks.

Syllabus

Intro
Pandas Limitations
How To Scale Out?
Distributed Computing Frameworks
Reducing Barrier to Entry
Introduction to Fugue
Fugue Transform
Bringing it to Spark
The DataFrame For Tests
Pandas Assumes Data Is Physically Together
Pandas Assumes Data Shuffle is Cheap
Pandas Assumes Eager Evaluation
Eager vs Lazy Evaluation
Expectation vs Reality
A Spark Solution Based On Traditional SOL Syntax
Fugue SQL
Leveraging Python
SQL Code Size & Execution Time