Discover how to build scalable and optimized data analytics pipelines by combining the powers of Apache Hadoop and Spark.
Overview
Syllabus
Introduction
- The combined power of Spark and Hadoop Distributed File System (HDFS)
- Apache Hadoop overview
- Apache Spark overview
- Integrating Hadoop and Spark
- Setting up the environment
- Using exercise files
- Storage formats
- Compression
- Partitioning
- Bucketing
- Best practices for data storage
- Reading external files into Spark
- Writing to HDFS
- Parallel writes with partitioning
- Parallel writes with bucketing
- Best practices for ingestion
- How Spark works
- Reading HDFS files with schema
- Reading partitioned data
- Reading bucketed data
- Best practices for data extraction
- Pushing down projections
- Pushing down filters
- Managing partitions
- Managing shuffling
- Improving joins
- Storing intermediate results
- Best practices for data processing
- Problem definition
- Data loading
- Total score analytics
- Average score analytics
- Top student analytics
- Next steps
Taught by
Kumaran Ponnambalam