Overview
Syllabus
Intro
Agenda
Big Data Platform
Old vs New cluster
Old Cluster: Performance Bottleneck
A Simple Aggregation Query
9k Mappers * 9k Reducers
New Cluster: Choose the right EC2 instance
Key Takeaways
Read after write consistency
How often does this happen
Solution. Considerations
Our Approach
Performance Comparison: S3 vs HDFS
Dealing with Metadata Operation
Reduce Move Operations
Multipart Upload API
The Last Move Operation
Fix Bucket Rate Limit Issue (503)
Improving S3Committer
S3 Benefit Compare to HDFS
Things We Miss in Mesos
Cost Saving
Spark at Pinterest
Taught by
Databricks