Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Udemy

Apache Spark 3 for Data Engineering & Analytics with Python

via Udemy

Overview

Learn how to use Python and PySpark 3.0.1 for Data Engineering / Analytics (Databricks) - Beginner to Ninja

What you'll learn:
  • Learn the Spark Architecture
  • Learn Spark Execution Concepts
  • Learn Spark Transformations and Actions using the Structured API
  • Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API
  • Learn how to set up your own local PySpark Environment
  • Learn how to interpret the Spark Web UI
  • Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution
  • Learn the RDD (Resilient Distributed Datasets) API (Crash Course)
  • Learn the Spark DataFrame API  (Structured APIs)
  • Learn Spark SQL
  • Learn Spark on Databricks
  • Learn to Visualize (Graphs and Dashboards) Data on Databricks

The key objectives of this course are as follows;

  • Learn the Spark Architecture

  • Learn Spark Execution Concepts

  • Learn Spark Transformations and Actions using the Structured API

  • Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API

  • Learn how to set up your own local PySpark Environment

  • Learn how to interpret the Spark Web UI

  • Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution

  • Learn the RDD (Resilient Distributed Datasets) API (Crash Course)

    • RDD Transformations

    • RDD Actions

  • Learn the Spark DataFrame API (Structured APIs)

    • Create Schemas and Assign DataTypes

    • Read and Write Data using the DataFrame Reader and Writer

    • Read Semi-Structured Data such as JSON

    • Create and New Data Columns to the DataFrame using Expressions

    • Filter the DataFrame using the "Filter" and "Where" Transformations

    • Ensure that the DataFrame has unique rows

    • Detect and Drop Duplicates

    • Augment the DataFrame by Adding New Rows

    • Combine 2 or More DataFrames

    • Order the DataFrame by Specific Columns

    • Renaming and Drop Columns from the DataFrame

    • Clean the DataFrame by detecting and Removing Missing or Bad Data

    • Create User-Defined Spark Functions

    • Read and Write to/from Parquet File

    • Partition the DataFrame and Write to Parquet File

    • Aggregate the DataFrame using Spark SQL functions (count, countDistinct, Max, Min, Sum, SumDistinct, AVG)

    • Perform Aggregations with Grouping

  • Learn Spark SQL and Databricks

    • Create a Databricks Account

    • Create a Databricks Cluster

    • Create Databricks SQL and Python Notebooks

    • Learn Databricks shortcuts

    • Create Databases and Tables using Spark SQL

    • Use DML, DQL, and DDL with Spark SQL

    • Use Spark SQL Functions

    • Learn the differences between Managed and Unmanaged Tables

    • Read CSV Files from the Databricks File System

    • Learn to write Complex SQL

    • Use Spark SQL Functions

    • Create Visualisations with Databricks

    • Create a Databricks Dashboard


The Python Spark project that we are going to do together;

Sales Data

  • Create a Spark Session

  • Read a CSV file into a Spark Dataframe

  • Learn to Infer a Schema

  • Select data from the Spark Dataframe

  • Produce analytics that shows the topmost sales orders per Region and Country


Convert Fahrenheit to Degrees Centigrade

  • Create a Spark Session

  • Read and Parallelize data using the Spark Context into an RDD

  • Create a Function to Convert Fahrenheit to Degrees Centigrade

  • Use the Map Function to convert data contained within an RDD

  • Filter temperatures greater than or equal to 13 degrees celsius


XYZ Research

  • Create a set of RDDs that hold Research Data

  • Use the union transformation to combine RDDs

  • Learn to use the subtract transformation to minus values from an RDD

  • Use the RDD API to answer the following questions

    • How many research projects were initiated in the first three years?

    • How many projects were completed in the first year?

    • How many projects were completed in the first two years?


Sales Analytics

  • Create the Sales Analytics DataFrame to a set of CSV Files

  • Prepare the DataFrame by applying a Structure

  • Remove bad records from the DataFrame (Cleaning)

  • Generate New Columns from the DataFrame

  • Write a Partitioned DataFrame to a Parquet Directory

  • Answer the following questions and create visualizations using Seaborn and Matplotlib

    • What was the best month in sales?

    • What city sold the most products?

    • What time should the business display advertisements to maximize the likelihood of customers buying products?

    • What products are often sold together in the state "NY"?

Technology Spec

  1. Python

  2. Jupyter Notebook

  3. Jupyter Lab

  4. PySpark (Spark with Python)

  5. Pandas

  6. Matplotlib

  7. Seaborne

  8. Databricks

  9. SQL

Taught by

David Charles Academy

Reviews

4.3 rating at Udemy based on 473 ratings

Start your review of Apache Spark 3 for Data Engineering & Analytics with Python

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.