AWS Data Processing and Analysis

Overview

This course takes you through the complete process of data handling, starting with AWS data processing services. You’ll begin with AWS Lambda, learning how to integrate serverless functions and manage scalable data pipelines. With practical exercises, you’ll explore how AWS Glue helps automate data preparation and manage complex ETL jobs, making data lake partitioning and modification of Glue Data Catalog easy to understand. Hands-on experience with Glue Studio and DataBrew will further enhance your knowledge in preparing data for analysis. The course also delves into processing large datasets using Amazon EMR, where you’ll work with Apache Spark, Hive, and other tools in the Hadoop ecosystem. You’ll learn to optimize data processing with EMR, partition and store data efficiently, and integrate it with AWS services like Kinesis and Redshift. Exercises in Apache Spark will show you how to analyze data streams and deliver actionable insights in real time. Lastly, you'll focus on the analysis aspect using services like Kinesis Analytics, OpenSearch, and Athena. The course will guide you through setting up advanced analytics using Kinesis, creating real-time monitoring applications, and visualizing data using OpenSearch and QuickSight. By the end of this course, you’ll be well-equipped to build, process, and analyze data pipelines at scale using AWS’s powerful tools. This course is ideal for data engineers, IT professionals, and data analysts aiming to leverage AWS for data processing and analysis. Some familiarity with AWS services is recommended.

Syllabus

Domain 3: Processing

In this module, we will delve into AWS processing services, beginning with an introduction to AWS Lambda and Glue. You’ll learn how to integrate these tools for serverless and ETL workflows. We will also explore advanced topics such as Glue ETL job execution, Lambda's cost optimization strategies, and EMR’s integration with other AWS services like Apache Spark, Hive, and Hadoop. Hands-on exercises will cover using Spark with Kinesis and Redshift, and how to process data lakes with EMR.

Domain 4: Analysis

In this module, we will focus on analyzing and querying data using AWS’s powerful analytics services. We begin with an introduction to Kinesis Analytics, OpenSearch, and Athena, followed by performance tuning and security best practices. Through hands-on exercises, you’ll build real-world applications to monitor data streams, optimize queries using Glue and Athena, and perform data warehousing with Redshift. Additionally, we’ll explore Redshift's durability, distribution styles, and newer features like AQUA and serverless options to improve large-scale data analytics.