Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

How to Automate Performance Tuning for Apache Spark

Databricks via YouTube

Overview

Discover how to streamline and automate performance tuning for Apache Spark in this 41-minute conference talk by Jean Yves Stephan from Data Mechanics. Learn about the challenges of maintaining efficient and stable data pipelines in production, including selecting appropriate infrastructure, configuring Spark correctly, and ensuring scalability as data volumes grow. Explore the key information and parameters for manual tuning, and delve into various automation options, from open-source tools to managed services. Gain insights into common issues like lack of parallelism, shuffle spill, and data skew, and understand how to leverage node metrics for improvements. The talk covers the iterative nature of performance tuning, cost-speed trade-offs, and the architecture and algorithms behind automated tuning tools. By the end, you'll have a comprehensive understanding of how to optimize your Spark applications and meet SLAs efficiently, even as you scale to hundreds or thousands of jobs.

Syllabus

Intro
What is performance tuning?
Why automate performance tuning?
Perf tuning is an iterative process
Common issues: lack of parallelism
Common issues: shuffle spill
Improvements based on node metrics
Cost-speed trade-off
Recap: manual perf tuning
Open source tuning tools
Motivations
Architecture (tech)
Architecture (algo)
Heuristics example
Evaluator
Experiment manager
Data Mechanics platform
Common issues: data skew
Impact of automated tuning

Taught by

Databricks

Reviews

Start your review of How to Automate Performance Tuning for Apache Spark

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.