Overview

Embark on a journey to master big data processing with Apache Spark and Scala. This course begins with setting up your development environment, ensuring you have a solid foundation in both Spark and Scala. You will dive into a Scala crash course that covers syntax, flow control, functions, and data structures, giving you the essential skills needed to work with Spark. Next, you will explore Spark's core concept, the Resilient Distributed Dataset (RDD). Through a series of hands-on activities and exercises, you will learn to manipulate RDDs, implement key/value operations, and perform complex data transformations. The course then transitions into SparkSQL, DataFrames, and DataSets, where you will practice querying structured data efficiently. You'll also tackle advanced Spark programming, where you’ll apply algorithms to real-world datasets, work with clusters, and optimize performance. As you progress, you will delve into machine learning with Spark MLlib and explore how to build recommendation systems, perform regression analysis, and implement decision trees. Finally, the course introduces Spark Streaming and GraphX, allowing you to process real-time data streams and graph-based data efficiently. By the end of this course, you will have the expertise to leverage Spark and Scala for complex data processing tasks in any industry. This course is designed for software engineers who want to expand their skills into the world of big data processing on a cluster. It is necessary to have some prior programming or scripting knowledge.

Syllabus

Getting Started

In this module, we will focus on installing the necessary tools like IntelliJ and Scala on your local system, followed by a brief introduction to the Apache Spark framework and its key concepts.

Scala Crash Course (Optional)

In this module, we will dive into Scala programming, covering its unique syntax, control flow, and key data structures like Map and List. You’ll get plenty of practice to ensure you’re comfortable with the language before jumping into Spark.

Using Resilient Distributed Datasets (RDDs)

In this module, we will break down the structure and usage of RDDs, the building blocks of Spark applications. You’ll work on real-world examples, including building histograms and analyzing social network data.

SparkSQL, DataFrames, and DataSets

In this module, we will introduce SparkSQL, DataFrames, and Datasets, which provide a higher-level abstraction for working with structured data. You’ll also compare and contrast the use of RDDs with these APIs.

Advanced Examples of Spark Programs

In this module, we will tackle advanced Spark programming examples, such as finding popular movies and superheroes in a social graph. We will also cover optimization techniques using broadcast variables and accumulators.

Running Spark on a Cluster

In this module, we will move from running Spark on a local desktop to deploying and scaling it on a real cluster using Amazon Elastic MapReduce (EMR). You’ll learn how to submit jobs, manage dependencies, and troubleshoot issues.

Machine Learning with Spark ML

In this module, we will explore Spark MLlib for machine learning. You’ll work through examples like generating movie recommendations and performing linear regression on large datasets.

Introduction to Spark Streaming

In this module, we will focus on real-time data processing using Spark Streaming. You’ll build streaming applications and learn to handle data in motion, both with DStreams and the newer Structured Streaming API.

Introduction to GraphX

In this module, we will cover Spark’s GraphX library for graph-parallel processing, using it to explore social network data and understand how connected entities are through complex algorithms like breadth-first search.

You Made It! Where to Go from Here

In this final module, we will provide additional learning resources and give you tips on leveraging your new skills to advance your career in the field of big data.