GShard- Scaling Giant Models with Conditional Computation and Automatic Sharding

GShard- Scaling Giant Models with Conditional Computation and Automatic Sharding

Yannic Kilcher via YouTube Direct link

- Backpropagation in Mixture-of-Experts

5 of 10

5 of 10

- Backpropagation in Mixture-of-Experts

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

GShard- Scaling Giant Models with Conditional Computation and Automatic Sharding

Automatically move to the next video in the Classroom when playback concludes

  1. 1 - Intro & Overview
  2. 2 - Main Results
  3. 3 - Mixture-of-Experts
  4. 4 - Difference to Scaling Classic Transformers
  5. 5 - Backpropagation in Mixture-of-Experts
  6. 6 - MoE Routing Algorithm in GShard
  7. 7 - GShard Einsum Examples
  8. 8 - Massively Multilingual Translation
  9. 9 - Results
  10. 10 - Conclusion & Comments

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.