Apache Spark Beyond Shuffling - Why it isn't Magic - but also where there is some really cool Magic

Overview

Explore Apache Spark's performance and scaling capabilities in this 42-minute conference talk from GOTO Chicago 2017. Dive into the intricacies of distributed systems as Holden Karau, a software development engineer at IBM's Spark Technology Center, demystifies Spark's perceived "magic" and reveals its true potential. Learn about Spark's APIs in various programming languages, understand the concept of RDDs (Resilient Distributed Datasets), and examine common problems like key skew and data explosion. Discover practical solutions for optimizing word count examples, handling large datasets, and leveraging data frames. Gain insights into testing libraries, corporate compliance, and additional resources for further learning about Spark. Conclude with a Q&A session to address specific queries and deepen your understanding of this powerful distributed computing framework.

Syllabus

Intro
Holdens background
Trans and clear
IBM
People
What is Spark
Why people come to Spark
The magic of Spark
What is RDD
RDD Example
Word Count Example
Important Note
Example
Problems withSPARC
Tokenizing the data
Magic
Key Skew
Data with Key Skew
Explosion
San Francisco
Hack
Grouping by Key
Bad Word Count
Data Size
Input
Reduce by Key
Data Frames
Fuzzy Pandas
Python
Driver
How does this break
Use data sets
Spark videos
Testing libraries
Spark books
Corporate compliance
Office hours
Questions

Taught by

GOTO Conferences

Reviews

Start your review of Apache Spark Beyond Shuffling - Why it isn't Magic - but also where there is some really cool Magic

Taught by

Never Stop Learning.