Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Yannic Kilcher via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a comprehensive video analysis of Switch Transformers, a groundbreaking approach to scaling AI models to trillion-parameter sizes while maintaining efficiency. Delve into the innovative use of sparsity and hard routing techniques developed by Google Brain to significantly increase model parameters without increasing computational costs. Learn about the Switch Transformer architecture, its performance gains, and how it compares to dense models in terms of speed and sample efficiency. Discover the intricacies of model, data, and expert parallelism, and examine experimental results that showcase the model's capabilities. Gain insights into training stabilization techniques and the process of distilling knowledge into dense models. Understand the potential impact of this technology on the future of AI scaling and its applications in multilingual settings.

Syllabus

- Intro & Overview
- Performance Gains from Scale
- Switch Transformer Architecture
- Model-, Data- and Expert-Parallelism
- Experimental Results
- Stabilizing Training
- Distillation into Dense Models
- Final Comments

Taught by

Yannic Kilcher

Reviews

Start your review of Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.