Completed
Linear Projection - Patches to Features
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Vision Transformer and Its Applications
Automatically move to the next video in the Classroom when playback concludes
- 1 Intro
- 2 Vision Transformer (Vit) and its Applications
- 3 Why it matters?
- 4 Human Visual Attention
- 5 Attention is Dot Product between 2 Features
- 6 In Natural Language Processing
- 7 Image to Patches
- 8 Linear Projection - Patches to Features
- 9 Vision Transformer is Invariant to Position de Patches
- 10 Position Embedding
- 11 Learnable Class Embedding
- 12 Why Layer Norm?
- 13 Why Skip Connection?
- 14 Why Multi-Head Self-Attention?
- 15 A Transformer Encoder is Made of L Encode Modules Stacked Together
- 16 Version based on Layers, MLP size, MSA heaus
- 17 Pre-training on a large dataset, fine-tune or the target dataset
- 18 Training by Knowledge Distillation (Deit)
- 19 Sematic Segmentation (mlou: 50.3 SETR vs baseline PSPNet on ADE20k)
- 20 Semantic Segmentation (mlou: 84.4 Segformer vs 82.2 SETR on Cityscapes)
- 21 Vision Transformer for STR (VITSTR)
- 22 Parameter, FLOPS, Speed Efficient
- 23 Medical Image Segmentation (DSC: 77.5 TransUnet vs 71.3 R50-Vit baseline)
- 24 Limitations
- 25 Recommended Open-Source Implementations of Vit