What you'll learn:
- Learn the Spark Architecture
- Learn Spark Execution Concepts
- Learn Spark Transformations and Actions using the Structured API
- Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API
- Learn how to set up your own local PySpark Environment
- Learn how to interpret the Spark Web UI
- Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution
- Learn the RDD (Resilient Distributed Datasets) API (Crash Course)
- Learn the Spark DataFrame APIÂ (Structured APIs)
- Learn Spark SQL
- Learn Spark on Databricks
- Learn to Visualize (Graphs and Dashboards) Data on Databricks
The key objectives of this course are as follows;
Learn the Spark Architecture
Learn Spark Execution Concepts
Learn Spark Transformations and Actions using the Structured API
Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API
Learn how to set up your own local PySpark Environment
Learn how to interpret the Spark Web UI
Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution
Learn the RDD (Resilient Distributed Datasets) API (Crash Course)
RDD Transformations
RDD Actions
Learn the Spark DataFrame API (Structured APIs)
Create Schemas and Assign DataTypes
Read and Write Data using the DataFrame Reader and Writer
Read Semi-Structured Data such as JSON
Create and New Data Columns to the DataFrame using Expressions
Filter the DataFrame using the "Filter" and "Where" Transformations
Ensure that the DataFrame has unique rows
Detect and Drop Duplicates
Augment the DataFrame by Adding New Rows
Combine 2 or More DataFrames
Order the DataFrame by Specific Columns
Renaming and Drop Columns from the DataFrame
Clean the DataFrame by detecting and Removing Missing or Bad Data
Create User-Defined Spark Functions
Read and Write to/from Parquet File
Partition the DataFrame and Write to Parquet File
Aggregate the DataFrame using Spark SQL functions (count, countDistinct, Max, Min, Sum, SumDistinct, AVG)
Perform Aggregations with Grouping
Learn Spark SQL and Databricks
Create a Databricks Account
Create a Databricks Cluster
Create Databricks SQL and Python Notebooks
Learn Databricks shortcuts
Create Databases and Tables using Spark SQL
Use DML, DQL, and DDL with Spark SQL
Use Spark SQL Functions
Learn the differences between Managed and Unmanaged Tables
Read CSV Files from the Databricks File System
Learn to write Complex SQL
Use Spark SQL Functions
Create Visualisations with Databricks
Create a Databricks Dashboard
The Python Spark project that we are going to do together;
Sales Data
Create a Spark Session
Read a CSV file into a Spark Dataframe
Learn to Infer a Schema
Select data from the Spark Dataframe
Produce analytics that shows the topmost sales orders per Region and Country
Convert Fahrenheit to Degrees Centigrade
Create a Spark Session
Read and Parallelize data using the Spark Context into an RDD
Create a Function to Convert Fahrenheit to Degrees Centigrade
Use the Map Function to convert data contained within an RDD
Filter temperatures greater than or equal to 13 degrees celsius
XYZ Research
Create a set of RDDs that hold Research Data
Use the union transformation to combine RDDs
Learn to use the subtract transformation to minus values from an RDD
Use the RDD API to answer the following questions
How many research projects were initiated in the first three years?
How many projects were completed in the first year?
How many projects were completed in the first two years?
Sales Analytics
Create the Sales Analytics DataFrame to a set of CSV Files
Prepare the DataFrame by applying a Structure
Remove bad records from the DataFrame (Cleaning)
Generate New Columns from the DataFrame
Write a Partitioned DataFrame to a Parquet Directory
Answer the following questions and create visualizations using Seaborn and Matplotlib
What was the best month in sales?
What city sold the most products?
What time should the business display advertisements to maximize the likelihood of customers buying products?
What products are often sold together in the state "NY"?
Technology Spec
Python
Jupyter Notebook
Jupyter Lab
PySpark (Spark with Python)
Pandas
Matplotlib
Seaborne
Databricks
SQL