Many companies and organizations have data they can’t use in its raw form. For data to become useful, it must go through the following stages:
- Data needs to be validated for completeness and accuracy.
- It needs to be stored and organized.
- It needs to be processed so it is accessible for analysis, visualization, or machine learning.
- It needs to be governed and secured.
A company starts to organize this data by creating a centralized repository that can store huge volumes and wide varieties of data. This is the data lake.
In this course, you will learn about the components, functions, and best practices for building data lakes on AWS.
There are four modules, including two self-paced digital courses and two hands-on labs.
Activities
This course includes interactive content, videos, knowledge checks, assessments, and hands-on labs.
Course objectives
In this course, you will learn to do the following:
- Explore best practices, technologies, and considerations for building a basic data lake solution on AWS.
- Recognize a typical process for building a scalable, secure, and cost-effective data lake solutions on AWS.
- Identify relevant AWS services and configuration options to consider.
- Review relevant AWS resources, documentation, and tutorials.
- Practice building a basic data lake solution. (Lab 1)
Intended audience
This course is intended for the following job roles:
- Data engineers
- Data scientists
- Data analysts
- Business intelligence engineers
Prerequisites
We recommend that attendees of this course have the following:
- 2–3 years of experience in data engineering
- 1–2 years of hands-on experience with AWS services
- Completed AWS Cloud Practitioner Essentials or equivalent
- Completed Fundamentals of Analytics on AWS Part 1 and 2
- Completed Data Engineering on AWS – Foundations
Course outline
Module 1: Building a Data Lake Solution (75 min)In this course, you explore the process, technologies, and guidelines for building a basic data lake solution on Amazon Web Services (AWS).
- Introduction
- Set Up Storage
- Ingest Data
- Build Data Catalog
- Transform Data
- Serve Data for Consumption
- Assessment
- Conclusion
In this lab, you explore the components of a data lake, organize your data into layers (or zones), and use Amazon S3 as the storage layer of your data lake.
- Task 1: Review S3 buckets for raw zone and consumption zone
- Task 2: Create S3 event notification and send events to Amazon EventBridge
- Task 3: Review the ingestion layer for your data lake solution
- Task 4: Review the processing layer for your data lake solution
- Task 5: Review the consumption layer for your data lake solution
In this course, you learn about ways to optimize the performance of your data lake on AWS. You also learn about security considerations and basic troubleshooting.
- Introduction
- Open Table Formats
- Security Using AWS Lake Formation
- Troubleshooting
- Assessment
- Conclusion
In this lab, you use a workflow provided as an AWS Lake Formation blueprint to simplify the creation of a data lake and ingestion of data. Lake Formation blueprints are workflows you can apply to an existing Lake Formation data lake. You can also apply them as a task in the setup and creation of a new data lake.
- Task 1: Explore the lab environment
- Task 2: Set up Lake Formation
- Task 3: Use a Lake Formation blueprint to create an AWS Glue workflow
- Task 4: Run and monitor the workflow
- Task 5: Validate the data lake setup results