Generate genuine business insights from big data. Learn to implement Apache Hadoop and Spark workflows on AWS.
Overview
Syllabus
Introduction
- Scaling Apache Hadoop and Spark
- What you should know
- Using cloud services
- Modern Hadoop and Spark
- File systems used with Hadoop and Spark
- Apache or commercial Hadoop distros
- Hadoop and Spark libraries
- Hadoop on Google Cloud Platform
- Spark Job on Google Cloud Platform
- Sign up for Databricks Community Edition
- Add Hadoop libraries
- Databricks AWS Community Edition
- Load data into tables
- Hadoop and Spark cluster on AWS EMR
- Run Spark job on AWS EMR
- Review batch architecture for ETL on AWS
- Apache Spark libraries
- Spark data interfaces
- Select your programming language
- Spark session objects
- Spark shell
- Tour the Databricks Environment
- Tour the notebook
- Import and export notebooks
- Calculate Pi on Spark
- Run WordCount of Spark with Scala
- Import data
- Transformations and actions
- Caching and the DAG
- Architecture: Streaming for prediction
- Spark SQL
- SparkR
- Spark ML: Preparing data
- Spark ML: Building the model
- Spark ML: Evaluating the model
- Advanced machine learning on Spark
- MXNet
- Spark with ADAM for genomics
- Spark architecture for genomics
- Reexamine streaming pipelines
- Spark Streaming
- Streaming ingest services
- Advanced Spark Streaming with MLeap
- Scale Spark on the cloud by example
- Build a quick start with Databricks AWS
- Scale Spark cloud compute with VMs
- Optimize cloud Spark virtual machines
- Use AWS EKS containers and data lake
- Optimize Spark cloud data tiers on Kubernetes
- Build reproducible cloud infrastructure
- Scale on GCP Dataproc or on Terra.bio
- Continue learning for scaling
Taught by
Lynn Langit