Overview
Syllabus
Intro
Performance will not usually mean evaluation metrics Optimization also does not mean optimization algorithms such as Adam, Adagrad, NAdam... Bias and Generalization will also not be covered Performance in this talk focuses on latency, examples per second...
Software engineers dealing with machine learning models Data Scientists needing to know how to train more performant models Developers generally curious about the harder problems of deploying larger models
Labeling and data quality Deploying models: Setting up a REST API Packaging: how to deploy your ML pipeline Experiment Tracking: Metrics, sharing results
Computer vision on: o Mobile Devices Single board computers (pis, jetson nano...) Big Servers with GPUS NLP on: Big Servers with GPUs Large CPU models
Data needs to be transformed before it can be used Fast transforms are usually an afterthought
ETL/Data Pipelines Primer • Raw data needs to be converted to arrays (think pandas data frame to numpy array) Data can come from anywhere: databases, the web (REST), streams (kafka, spark, flink...) Data can have different formats: GRPC, JSON, Arrow • In summary: Data movement and storage is its own set of problems
Models Primer • Models are stored in various formats: hdfs (keras), protobuf (tensorflow.onnx), pickle (pytorch) • Model files are a mix of configuration and parameters (ndarrays that represent the weights of the model) Model files can be anywhere from kbs to gigabytes
ML Pipelines are not just models • ETL varies and can be represented in json, code, or even within the model via something like tf.data • Metrics and experiments (evaluation results) may also be stored with a given model
Better in memory file formats for data interchange
Removing redundancy matters: Identity ops, redundant layers... Model Size matters: less parameters and compute-faster, less storage • Format matters: Some execution engines (If lite vs tensorflow, torchscript vs normal pytorch) allow execution on different platforms
Quantization: Change model data type to int from float (reduces memory and computation) Knowledge Distillation: Train a smaller model based on the outputs of a bigger model (student/teacher) Pruning: Removing layers Batch Norm Folding: Remove Batch Norm by taking its statistics and baking it in to the previous convolution layer - Fully Connected Layers Decomposition
Deep Learning Compilers: TVM, Glow, MLIR Compiles models to executable binaries Handles finding optimal graph for a given hardware configuration Note: Not ready for production use. Very early days yet. Simplify: TF Optimize: optimization
Taught by
Open Data Science