How Adobe Processes 2 Million Records Per Second Using Apache Spark

Overview

Explore how Adobe processes 2 million records per second using Apache Spark in this 41-minute Databricks conference talk. Dive into the challenges and solutions of Adobe's Unified Profile System, which ingests terabytes of data daily. Learn about optimizing repeated queries, understanding join operations, monitoring structured streaming lag, handling data skew, effective sampling techniques, and leveraging Redis for enhanced performance. Gain valuable insights from Adobe's experiences in scaling their Apache Spark deployment, including practical tips on caching physical plans, managing shuffles, dealing with backpressure, and making code resilient to skewed datasets. Benefit from real-world war stories and lessons that can be applied to large-scale data processing challenges in your own projects.

Syllabus

Intro
What do you mean by Processing? Agenda!
Unified Profile Data Ingestion
Generic Flow
Flow with MinPartitions partitions on Kafka
MicroBatch Hard! Logic Best Practices
An Example
For Repeated Queries Over Same DF
Join Optimization For Interactive Queries (Opinionated)
How to get the magic targetPartitionCount?
Digging into Redis Pipelining + Spark

Taught by

Databricks

Reviews

Start your review of How Adobe Processes 2 Million Records Per Second Using Apache Spark

Taught by

Optimizing Apache Spark on Databricks

Deep Dive into New Features of Apache Spark 3.1

Never Stop Learning.