In this lab, you learn how to visualize, prepare data and transform a dataset in SageMaker Data Wangler. You will also use S3 and SageMaker Studio to interact with Apache Hive using Apache Spark.
Objectives
- Understand effective methods for visualizing data
- Explore methods for data cleaning and transformation and how to process missing values, outliers, duplicated data, etc.
- Learn how to ingest and transform data into Amazon Sagemaker Data Wrangler
- Experiment with how to transform data using Spark on Amazon EMR
Prerequisites
- Basic navigation of the AWS Management Console.
- An understanding of database concepts, MySQL, and database availability.
Outline
- Task 1: Import, visualize, and perform a preliminary analysis of the data with SageMaker Data Wrangler
- Task 2: Analyze and visualize the data
- Task 3: Perform data transformations and export the datasets
- Task 4: Set up the environment
- Task 5: Connect to an EMR cluster
- Task 6: Explore and query data from the SparkMagic PySpark kernel