Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Big Data Analytics with Hadoop and Apache Spark

via LinkedIn Learning

Go to class Write review

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Grab it

Discover how to build scalable and optimized data analytics pipelines by combining the powers of Apache Hadoop and Spark.

Syllabus

Introduction

The combined power of Spark and Hadoop Distributed File System (HDFS)

1. Introduction and Setup

Apache Hadoop overview
Apache Spark overview
Integrating Hadoop and Spark
Setting up the environment
Using exercise files

2. HDFS Data Modeling for Analytics

Storage formats
Compression
Partitioning
Bucketing
Best practices for data storage

3. Data Ingestion with Spark

Reading external files into Spark
Writing to HDFS
Parallel writes with partitioning
Parallel writes with bucketing
Best practices for ingestion

4. Data Extraction with Spark

How Spark works
Reading HDFS files with schema
Reading partitioned data
Reading bucketed data
Best practices for data extraction

5. Optimizing Spark Processing

Pushing down projections
Pushing down filters
Managing partitions
Managing shuffling
Improving joins
Storing intermediate results
Best practices for data processing

6. Use Case Project

Problem definition
Data loading
Total score analytics
Average score analytics
Top student analytics

Conclusion

Next steps

Taught by

Kumaran Ponnambalam

Reviews

4.5 rating at LinkedIn Learning based on 413 ratings

Start your review of Big Data Analytics with Hadoop and Apache Spark