Learn about DataFrames, a widely used data structure in Apache Spark. Discover how to manipulate and analyze distributed data with the DataFrames API and SQL.
Overview
Syllabus
Introduction
- Apache Spark SQL and data analysis
- What you should know
- Introduction to DataFrames
- SQL for DataFrames
- Install Spark
- Install PySpark
- Using Jupyter notebooks with PySpark
- Set up a Jupyter notebook
- Load data into DataFrames: CSV Files
- Load data into DataFrames: JSON Files
- Basic DataFrame operations
- Filter data with DataFrame API
- Aggregate data with DataFrame API
- Sample data from DataFrames
- Save data from DataFrames
- Querying DataFrames with SQL
- Filtering DataFrames with SQL
- Aggregating Data with SQL
- Joining DataFrames with SQL
- Eliminating duplicates in DataFrames
- Working with NA values in DataFrames
- Exploratory data analysis with DataFrames
- Exploratory data analysis with Spark SQL
- Timeseries analysis with DataFrames
- Basic machine learning with DataFrames, part 1
- Basic machine learning with DataFrames, part 2
- Next steps
Taught by
Dan Sullivan