Deploying Optimized Deep Learning Pipelines

Overview

Explore a comprehensive 45-minute conference talk surveying various techniques for optimizing deep learning pipelines. Dive into advanced optimization methods like quantization, model distillation, and efficient math library selection, while examining their trade-offs in deployment scenarios. Gain insights into performance optimization focusing on latency and examples per second, tailored for software engineers, data scientists, and developers working with large-scale machine learning models. Cover crucial aspects of ML pipeline deployment, including data transformation, ETL processes, model storage formats, and emerging deep learning compilers. Learn about strategies to reduce model size, improve memory efficiency, and enhance computational speed through techniques such as pruning, batch norm folding, and knowledge distillation.

Syllabus

Intro
Performance will not usually mean evaluation metrics Optimization also does not mean optimization algorithms such as Adam, Adagrad, NAdam... Bias and Generalization will also not be covered Performance in this talk focuses on latency, examples per second...
Software engineers dealing with machine learning models Data Scientists needing to know how to train more performant models Developers generally curious about the harder problems of deploying larger models
Labeling and data quality Deploying models: Setting up a REST API Packaging: how to deploy your ML pipeline Experiment Tracking: Metrics, sharing results
Computer vision on: o Mobile Devices Single board computers (pis, jetson nano...) Big Servers with GPUS NLP on: Big Servers with GPUs Large CPU models
Data needs to be transformed before it can be used Fast transforms are usually an afterthought
ETL/Data Pipelines Primer • Raw data needs to be converted to arrays (think pandas data frame to numpy array) Data can come from anywhere: databases, the web (REST), streams (kafka, spark, flink...) Data can have different formats: GRPC, JSON, Arrow • In summary: Data movement and storage is its own set of problems
Models Primer • Models are stored in various formats: hdfs (keras), protobuf (tensorflow.onnx), pickle (pytorch) • Model files are a mix of configuration and parameters (ndarrays that represent the weights of the model) Model files can be anywhere from kbs to gigabytes
ML Pipelines are not just models • ETL varies and can be represented in json, code, or even within the model via something like tf.data • Metrics and experiments (evaluation results) may also be stored with a given model
Better in memory file formats for data interchange
Removing redundancy matters: Identity ops, redundant layers... Model Size matters: less parameters and compute-faster, less storage • Format matters: Some execution engines (If lite vs tensorflow, torchscript vs normal pytorch) allow execution on different platforms
Quantization: Change model data type to int from float (reduces memory and computation) Knowledge Distillation: Train a smaller model based on the outputs of a bigger model (student/teacher) Pruning: Removing layers Batch Norm Folding: Remove Batch Norm by taking its statistics and baking it in to the previous convolution layer - Fully Connected Layers Decomposition
Deep Learning Compilers: TVM, Glow, MLIR Compiles models to executable binaries Handles finding optimal graph for a given hardware configuration Note: Not ready for production use. Very early days yet. Simplify: TF Optimize: optimization

Taught by

Open Data Science

Reviews

Start your review of Deploying Optimized Deep Learning Pipelines

Taught by

Getting Started with Caffe - Deep Learning Framework Introduction - Class 3

Debugging Neural Nets for NLP

10 Best Deep Learning Courses for 2024

Never Stop Learning.