From Idea to Model: Productionizing Data Pipelines with Apache Airflow

Overview

Explore the process of transforming a data science idea into a production-ready model using Apache Airflow in this 22-minute conference talk from Databricks. Learn how data engineers can build a flexible platform that satisfies the needs of various stakeholders, including data scientists, infrastructure engineers, and product owners. Discover how Apache Airflow serves as a collaborative tool between data scientists and infrastructure engineers, offering a pythonic interface that abstracts system complexities. Follow the journey of a single-machine notebook evolving into a cross-service Spark + Tensorflow pipeline, culminating in a canary-tested, hyper-parameter-tuned model deployed on Google Cloud Functions. Gain insights into Airflow's ability to connect different layers of a data team, enabling rapid results and efficient collaboration. Understand the benefits for both data engineers and analysts, including custom operator creation, job submission, and pipeline building. Delve into topics such as the data ecosystem, bumper rail models, and the advantages of using established tools over building from scratch.

Syllabus

Intro
Data Ecosystem
Data Scientists
Data Infrastructure
Data Analysts
Bumper Rail Model
Don't Build Your Own!!
What's in it for the Data Engineers?
Submitting a Spark Job
Can Abstract Many Spark System Configurations
Data Engineers Can Create Custom Operators
What's in it for the Analysts?
Building a Data Science Pipeline
Experiment
Jupyter Notebooks + Airflow
Parameterize
Getting involved with Apache Airflow