Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Yannic Kilcher via YouTube Direct link

- Model-, Data- and Expert-Parallelism

4 of 8

4 of 8

- Model-, Data- and Expert-Parallelism

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Automatically move to the next video in the Classroom when playback concludes

  1. 1 - Intro & Overview
  2. 2 - Performance Gains from Scale
  3. 3 - Switch Transformer Architecture
  4. 4 - Model-, Data- and Expert-Parallelism
  5. 5 - Experimental Results
  6. 6 - Stabilizing Training
  7. 7 - Distillation into Dense Models
  8. 8 - Final Comments

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.