Advances in Quantization for Efficient On-Device Neural Network Inference

Overview

Learn about cutting-edge quantization techniques for efficient on-device AI inference in this 18-minute conference talk from Qualcomm AI Research Staff Engineer Mart van Baalen at tinyML EMEA. Explore the critical comparison between FP8 and INT8 formats, understand the challenges of oscillations in quantization-aware training, and discover solutions for handling outliers in transformers and large language models. Gain practical insights into mixed-precision methods and learn how to optimize deep neural networks for reduced memory usage, compute requirements, and energy consumption. Delve into detailed technical discussions covering the distribution differences between numerical formats, accuracy comparisons, and practical implementation challenges in quantization-aware training, with specific examples using MobileNetV2 architecture. Master the techniques needed to make AI more efficient and deployable on edge devices within strict power and thermal constraints.

Syllabus

Intro
Low-precision numerical formats
INT8 and FP8 have the same number of values but different distributions.
INT8 and FP8 accuracy
Challenges in using integer quantization
Introduction to Quantization-Aware Training (QAT)
Oscillating weights in QAT
MobileNetV2 - comparison to literature
Why do outliers occur?
Outliers in Transformers

Taught by

EDGE AI FOUNDATION

Reviews

Start your review of Advances in Quantization for Efficient On-Device Neural Network Inference

Taught by

Advanced Network Quantization and Compression Using AIMET - tinyML Summit 2021

Accelerating Vision AI Applications Using NVIDIA Transfer Learning Toolkit and Pre-Trained Models

The Model Efficiency Pipeline: Enabling Deep Learning Inference at the Edge

On-Device Learning Under 256KB Memory - Challenges and Solutions for IoT Devices

Optimizing LLM Fine-Tuning with PEFT and LoRA Adapter-Tuning for GPU Performance

AWQ for LLM Quantization - Efficient Inference Framework for Large Language Models

Never Stop Learning.