Beyond Shuffling - Scaling Apache Spark

Overview

Explore advanced techniques for scaling Apache Spark in this 43-minute conference talk from Scala Days Berlin 2016. Delve into best practices and code snippets for handling large datasets efficiently. Learn to leverage Spark counters for performance investigation, optimize key-value data operations, and replace groupByKey with memory-efficient alternatives. Discover effective caching and checkpointing strategies to reduce execution time. Gain insights on functional transformations using Spark Datasets, working in noisy cluster environments, and utilizing Spark SQL for improved performance. Master the art of validating Spark jobs with accumulators and explore additional testing resources to enhance your Spark development skills.

Syllabus

Intro
What is going to be covered
The different pieces of Spark
What is key skew and why do we care?
Well there is a bit of magic in the shume....
Iterator tortor transformations
Why is Spark SQL good for those things?
How much faster can it be?
How to avoid lineage explosions
Introducing Datasets
And functional style maps
Switching gears: Valdating Spark jobs
Using an accumulator for validation
Validating records read matches our expectations
Additional Spark Testing Resources
Additional Spark Resources
Spark Videos

Taught by

Scala Days Conferences

Reviews

Start your review of Beyond Shuffling - Scaling Apache Spark

Taught by

Introduction to Spark Datasets

Never Stop Learning.