Explore a groundbreaking approach to neural network inference optimization in this 22-minute conference talk from OSDI '24. Delve into MonoNN, a novel machine learning optimizing compiler that introduces a monolithic design for static neural network inference tasks on modern GPU architectures. Learn how this innovative system accommodates entire neural networks into a single GPU kernel, significantly reducing non-computation overhead and unlocking new optimization opportunities. Discover the key challenges addressed by MonoNN, including resource incompatibility between neural network operators and the exploration of parallelism compensation strategies. Gain insights into the schedule-independent group tuning technique that efficiently manages the vast optimization space. Examine the impressive performance gains achieved by MonoNN, with average speedups of 2.01× over state-of-the-art frameworks and compilers, and specific improvements of up to 7.3× compared to leading solutions like TVM, TensorRT, XLA, and AStitch. Access the open-source implementation to further explore this cutting-edge advancement in GPU-centric neural network optimization.
Overview
Syllabus
OSDI '24 - MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference...
Taught by
USENIX