Learn how to make your pandas code quicker and more efficient. This course covers vectorization, common mistakes, pandas performance, saving memory, Numba, Cython, and more.
Overview
Syllabus
Introduction
- pandas and performance
- What you should know
- Working with the files on GitHub
- Why performance matters
- Setting goals
- Measuring performance
- Profiling
- Challenge: Identify bottleneck
- Solution: Identify bottleneck
- What is vectorization?
- Boolean indexing
- Understanding ufuncs
- Challenge: Selecting and manipulating data
- Solution: Selecting and manipulating data
- The limitations of appending
- The limitations of object dtype
- The limitations of row iteration
- Understanding the isin function
- Parsing time once
- Challenge: Query a DataFrame
- Solution: Query a DataFrame
- Using built-in functions
- Understanding eval and query
- Understanding the join function
- Challenge: Join and query
- Solution: Join and query
- Why memory is important?
- Measuring memory
- Loading parts of data
- Categorical data
- Challenge: Reducing memory
- Solution: Reducing memory
- Various formats and why not CSV
- Optimizing with SQL
- Optimizing with HDF5
- Challenge: Bike ride duration
- Solution: Bike ride duration
- What is Numba?
- Using Numba
- What's Cython?
- Writing Cython code
- Compiling Cython
- %%cython magic
- Challenge: Cython speedup
- Solution: Cython speedup
- Overview of alternative DataFrames
- Using Dask
- Using Vaex
- Challenge: Vaex vs. pandas
- Solution: Vaex vs. pandas
- Next steps
Taught by
Miki Tebeka