Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore techniques to accelerate data processing in this 30-minute EuroPython 2020 conference talk. Learn about common bottlenecks in data science workflows and how to overcome them using parallel and asynchronous programming with Python's concurrent.futures module. Discover the differences between sequential and parallel processing, synchronous and asynchronous execution, and when to apply these concepts in network I/O operations and computation-driven workloads. Gain practical insights into implementing parallelism and asynchronous programming to optimize data processing pipelines, allowing more focus on extracting value from data. Through real-life analogies, understand concepts like Amdahl's Law, multiprocessing vs multithreading, and practical implementations using ThreadPoolExecutor and ProcessPoolExecutor. Suitable for data scientists, engineers, and anyone with basic Python knowledge interested in improving data processing efficiency.
Syllabus
Intro
A typical data science workflow
Data Processing in Python
Challenges with Data Processing
Task 1: Toast 100 slices of bread
Sequential Processing
Parallel Processing
Task 2: Brew coffee
Synchronous Execution
Practical Considerations
Amdahl's Law and Parallelism
Multiprocessing vs Multithreading
Initialize Submission List
Using ThreadPoolExecutor
Initialize Python modules
Initialize image resize process
Initialize File List in Directory
Using List Comprehensions
Using Process PoolExecutor
Taught by
EuroPython Conference