Overview
Discover key insights from migrating machine learning and big data processing pipelines to Apache Airflow in this 38-minute conference talk. Explore how Skimlinks leverages Airflow to power their big data infrastructure, analyzing hundreds of terabytes of data. Learn about building ETL pipelines and managing machine learning Spark pipeline workflows using Airflow. Gain understanding of basic Airflow concepts and see real-life examples of defining workflows in Python code. Delve into advanced topics such as custom task operators, sensors, and plugins. Examine best practices, pros and cons of the tool, and implementation in Docker and Kubernetes environments. Understand the concept of Directed Acyclic Graphs (DAGs) and their importance in creating idempotent workflows.
Syllabus
Intro
Lessons learned from the migration to Apache Airflow
Agenda
Skimlinks: What we do
Why Airflow?
Data Architecture Overv
Airflow and Spark
DAG: Directed Acyclic Graph
Operator
Advanced Features
Sample code
Idempotent DAGS
Best practices: Docker and Kubernetes environments
Airflow: The Good, the Bad and the Ugly
Taught by
Linux Foundation