Completed
Attention is Dot Product between 2 Features
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Vision Transformer and Its Applications
Automatically move to the next video in the Classroom when playback concludes
- 1 Intro
- 2 Vision Transformer (Vit) and its Applications
- 3 Why it matters?
- 4 Human Visual Attention
- 5 Attention is Dot Product between 2 Features
- 6 In Natural Language Processing
- 7 Image to Patches
- 8 Linear Projection - Patches to Features
- 9 Vision Transformer is Invariant to Position de Patches
- 10 Position Embedding
- 11 Learnable Class Embedding
- 12 Why Layer Norm?
- 13 Why Skip Connection?
- 14 Why Multi-Head Self-Attention?
- 15 A Transformer Encoder is Made of L Encode Modules Stacked Together
- 16 Version based on Layers, MLP size, MSA heaus
- 17 Pre-training on a large dataset, fine-tune or the target dataset
- 18 Training by Knowledge Distillation (Deit)
- 19 Sematic Segmentation (mlou: 50.3 SETR vs baseline PSPNet on ADE20k)
- 20 Semantic Segmentation (mlou: 84.4 Segformer vs 82.2 SETR on Cityscapes)
- 21 Vision Transformer for STR (VITSTR)
- 22 Parameter, FLOPS, Speed Efficient
- 23 Medical Image Segmentation (DSC: 77.5 TransUnet vs 71.3 R50-Vit baseline)
- 24 Limitations
- 25 Recommended Open-Source Implementations of Vit