Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Google

Building Batch Data Pipelines on Google Cloud

Google via Google Cloud Skills Boost

Overview

Data pipelines typically fall under one of the Extract and Load (EL), Extract, Load and Transform (ELT) or Extract, Transform and Load (ETL) paradigms. This course describes which paradigm should be used and when for batch data. Furthermore, this course covers several technologies on Google Cloud for data transformation including BigQuery, executing Spark on Dataproc, pipeline graphs in Cloud Data Fusion and serverless data processing with Dataflow. Learners get hands-on experience building data pipeline components on Google Cloud using Qwiklabs.

Syllabus

  • Introduction
    • Course Introduction
  • Introduction to Building Batch Data Pipelines
    • Module introduction
    • EL, ELT, ETL
    • Quality considerations
    • How to carry out operations in BigQuery
    • Shortcomings
    • ETL to solve data quality issues
    • Introduction to Building Batch Data Pipelines
  • Executing Spark on Dataproc
    • Module introduction
    • The Hadoop ecosystem
    • Running Hadoop on Dataproc
    • Cloud Storage instead of HDFS
    • Optimizing Dataproc
    • Optimizing Dataproc storage
    • Optimizing Dataproc templates and autoscaling
    • Optimizing Dataproc monitoring
    • Lab Intro: Running Apache Spark jobs on Dataproc
    • Running Apache Spark jobs on Cloud Dataproc
    • Summary
    • Executing Spark on Dataproc
  • Serverless Data Processing with Dataflow
    • Module introduction
    • Introduction to Dataflow
    • Why customers value Dataflow
    • Building Dataflow pipelines in code
    • Key considerations with designing pipelines
    • Transforming data with PTransforms
    • Lab Intro: Building a Simple Dataflow Pipeline
    • A Simple Dataflow Pipeline (Python) 2.5
    • Serverless Data Analysis with Dataflow: A Simple Dataflow Pipeline (Java)
    • Aggregate with GroupByKey and Combine
    • Lab Intro: MapReduce in Beam
    • MapReduce in Beam (Python) 2.5
    • Serverless Data Analysis with Beam: MapReduce in Beam (Java)
    • Side inputs and windows of data
    • Lab Intro: Serverless Data Analysis with Dataflow: Side Inputs
    • Serverless Data Analysis with Dataflow: Side Inputs (Python)
    • Serverless Data Analysis with Dataflow: Side Inputs (Java)
    • Creating and re-using pipeline templates
    • Summary
    • Serverless Data Processing with Dataflow
  • Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
    • Module introduction
    • Introduction to Cloud Data Fusion
    • Components of Cloud Data Fusion
    • Cloud Data Fusion UI
    • Build a pipeline
    • Explore data using wrangler
    • Lab Intro: Building and executing a pipeline graph in Cloud Data Fusion
    • Building and Executing a Pipeline Graph with Data Fusion 2.5
    • Orchestrate work between Google Cloud services with Cloud Composer
    • Apache Airflow environment
    • DAGs and Operators
    • Workflow scheduling
    • Monitoring and Logging
    • Lab Intro: An Introduction to Cloud Composer
    • An Introduction to Cloud Composer 2.5
    • Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
  • Course Summary
    • Course Summary
  • Course Resources
    • Building Batch Data Pipelines on Google Cloud
  • Your Next Steps
    • Course Badge

Reviews

Start your review of Building Batch Data Pipelines on Google Cloud

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.