Downscaling Apache Spark Clusters - Challenges and Solutions

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Grab it

Explore the challenges and solutions for downscaling Apache Spark clusters in this 36-minute conference talk by Prakhar Jain from Databricks. Dive into the complexities of removing nodes from running Spark-on-Yarn clusters when workload decreases, addressing issues like container fragmentation and shuffle data retention. Learn about innovative approaches to improve downscaling, including changes in YARN's container allocation strategy and Spark's task scheduler for better container packing. Discover enhancements to Spark driver and External Shuffle Service (ESS) that enable proactive deletion of consumed shuffle data, facilitating faster node reclamation. Gain insights into terminology, resource allocation strategies, and the impact of minimum executors on downscaling. Examine the production and consumption of shuffle data, the role of ESS, and potential solutions for long-running applications. Conclude with an overview of Spark's compute and storage disaggregation and future directions for cluster downscaling optimization.

Syllabus

Intro
Autoscaling on cloud
Upscale easy, downscale difficult
How are nodes used?
Factors affecting node downscaling
Terminology Any cluster generally comprises of following entities: • Resource Manager
Current resource allocation strategy
Example revisited with new allocation strategy
Downscale issues with Min Executors
Min executors distribution without packing
Min executors distribution with packing
How Shuffle data is produced / consumed?
External Shuffle Service
ESS at Qubole
Recap
Shuffle Cleanup • Shuffle data is deleted at the end of application by ESS
Issues with long running applications
Shuffle reuse in Spark
Downscaling a Node
Spark - Disaggregation of Compute and Storage • Mount some NFS endpoint on all the nodes of cluster • Change shuffle manager in Spark to something which can read/write shuffle from NFS mount point
Summary and Future Work

Taught by

Databricks

Reviews

Start your review of Downscaling Apache Spark Clusters - Challenges and Solutions

Taught by

Super Reliable Cloud Native Data Processing Using Apache Spark and Cloud Shuffle Manager

How to Make Apache Spark on Kubernetes Run Reliably on Spot Instances

Tackling Scaling Challenges of Apache Spark at LinkedIn - Infrastructure Optimization and User Productivity

Running Apache Spark on Kubernetes - Best Practices and Pitfalls

Never Stop Learning.