Stanford Seminar - Mixture of Experts Paradigm and the Switch Transformer

Overview

Explore the groundbreaking Mixture of Experts (MoE) paradigm and the Switch Transformer in this Stanford seminar. Delve into how MoE challenges traditional deep learning models by selecting different parameters for each input, resulting in sparsely-activated models with vast numbers of parameters but constant computational cost. Learn about the simplification of MoE routing algorithms, improved model designs with reduced communication and computational costs, and innovative training techniques that address instabilities. Discover how large sparse models can be trained using lower precision formats, leading to significant increases in pre-training speed. Examine the application of these improvements in multilingual settings and the advancement of language model scale to trillion-parameter models. Gain insights from research scientists Barret Zoph and Irwan Bello as they discuss their work on various deep learning topics, including neural architecture search, data augmentation, semi-supervised learning, and model sparsity.