Vision Transformer and Its Applications

Vision Transformer and Its Applications

Open Data Science via YouTube Direct link

In Natural Language Processing

6 of 25

6 of 25

In Natural Language Processing

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Vision Transformer and Its Applications

Automatically move to the next video in the Classroom when playback concludes

  1. 1 Intro
  2. 2 Vision Transformer (Vit) and its Applications
  3. 3 Why it matters?
  4. 4 Human Visual Attention
  5. 5 Attention is Dot Product between 2 Features
  6. 6 In Natural Language Processing
  7. 7 Image to Patches
  8. 8 Linear Projection - Patches to Features
  9. 9 Vision Transformer is Invariant to Position de Patches
  10. 10 Position Embedding
  11. 11 Learnable Class Embedding
  12. 12 Why Layer Norm?
  13. 13 Why Skip Connection?
  14. 14 Why Multi-Head Self-Attention?
  15. 15 A Transformer Encoder is Made of L Encode Modules Stacked Together
  16. 16 Version based on Layers, MLP size, MSA heaus
  17. 17 Pre-training on a large dataset, fine-tune or the target dataset
  18. 18 Training by Knowledge Distillation (Deit)
  19. 19 Sematic Segmentation (mlou: 50.3 SETR vs baseline PSPNet on ADE20k)
  20. 20 Semantic Segmentation (mlou: 84.4 Segformer vs 82.2 SETR on Cityscapes)
  21. 21 Vision Transformer for STR (VITSTR)
  22. 22 Parameter, FLOPS, Speed Efficient
  23. 23 Medical Image Segmentation (DSC: 77.5 TransUnet vs 71.3 R50-Vit baseline)
  24. 24 Limitations
  25. 25 Recommended Open-Source Implementations of Vit

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.