Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore how Adobe processes 2 million records per second using Apache Spark in this 41-minute Databricks conference talk. Dive into the challenges and solutions of Adobe's Unified Profile System, which ingests terabytes of data daily. Learn about optimizing repeated queries, understanding join operations, monitoring structured streaming lag, handling data skew, effective sampling techniques, and leveraging Redis for enhanced performance. Gain valuable insights from Adobe's experiences in scaling their Apache Spark deployment, including practical tips on caching physical plans, managing shuffles, dealing with backpressure, and making code resilient to skewed datasets. Benefit from real-world war stories and lessons that can be applied to large-scale data processing challenges in your own projects.
Syllabus
Intro
What do you mean by Processing? Agenda!
Unified Profile Data Ingestion
Generic Flow
Flow with MinPartitions partitions on Kafka
MicroBatch Hard! Logic Best Practices
An Example
For Repeated Queries Over Same DF
Join Optimization For Interactive Queries (Opinionated)
How to get the magic targetPartitionCount?
Digging into Redis Pipelining + Spark
Taught by
Databricks