In this lab, you will deploy a fully functional Hadoop cluster, ready to analyze log data in just a few minutes. You will start by launching an Amazon EMR cluster and then use a HiveQL script to process sample log data stored in an Amazon S3 bucket. HiveQL is a SQL-like scripting language for data warehousing and analysis. You can then use a similar setup to analyze your own log files.
Level
Fundamental
Duration
1 Hours 0 MinutesCourse Objectives
In this course, you will learn how to:
- Launch a fully functional Hadoop cluster using **Amazon EMR**
- Define the schema and create a table for sample log data stored in Amazon S3
- Analyze the data using a **HiveQL** script and write the results back to Amazon S3
- Download and view the results on your computer
- Connect to the Hive CLI and run **HiveQL** query script to view the results
Intended Audience
This course is intended for:
- Data Engineers
Prerequisites
We recommend that attendees of this course have the following prerequisites:
- IT Experience:Â Prior experience with Hadoop is recommended, but not required, to complete this lab
- AWS Experience: Basic familiarity with Amazon S3 and Amazon EC2 key pairs is suggested, but not required, to complete this project
Course Outline
- Task 1: Create an Amazon S3 bucket
- Task 2: Launch an Amazon EMR cluster
- Task 3: Process Your Sample Data by Running a Hive Script
- Task 4: View the Results
- Task 5 : Connect to the EMR cluster CLI and perform query using HiveQL
- Task 6: Terminate your Amazon EMR Cluster