Blockwise Parallel Decoding for Deep Autoregressive Models

Overview

Explore a novel blockwise parallel decoding scheme for deep autoregressive sequence-to-sequence models in this informative video. Learn how this approach allows for substantial improvements in generation speed when applied to architectures that can process output sequences in parallel. Discover the empirical verification of this method through experiments using state-of-the-art self-attention models for machine translation and image super-resolution. Understand how the proposed technique achieves iteration reductions of up to 2x over baseline greedy decoders without quality loss, or up to 7x with a slight performance decrease. Examine the real-time speedups of up to 4x over standard greedy decoding in terms of wall-clock time. Gain insights into the trade-offs between computation needed per layer and critical path length at training time for different architecture classes such as recurrent, convolutional, and self-attention networks.