Vision Transformer and Its Applications

Overview

Explore a 35-minute talk on Vision Transformer and its applications in computer vision. Delve into the breakthrough model architecture, focusing on self-attention and its role in vision. Examine various implementations utilizing Vision Transformer as the main backbone, including applications in recognition, detection, segmentation, multi-modal learning, and scene text recognition. Discover the potential of self-attention beyond transformers in building general-purpose model architectures capable of processing diverse data formats such as text, audio, image, and video. Learn about training techniques, including pre-training on large datasets and knowledge distillation. Investigate the model's performance in semantic segmentation, medical image segmentation, and its parameter, FLOPS, and speed efficiency. Understand the limitations of Vision Transformers and gain insights into recommended open-source implementations.

Syllabus

Intro
Vision Transformer (Vit) and its Applications
Why it matters?
Human Visual Attention
Attention is Dot Product between 2 Features
In Natural Language Processing
Image to Patches
Linear Projection - Patches to Features
Vision Transformer is Invariant to Position de Patches
Position Embedding
Learnable Class Embedding
Why Layer Norm?
Why Skip Connection?
Why Multi-Head Self-Attention?
A Transformer Encoder is Made of L Encode Modules Stacked Together
Version based on Layers, MLP size, MSA heaus
Pre-training on a large dataset, fine-tune or the target dataset
Training by Knowledge Distillation (Deit)
Sematic Segmentation (mlou: 50.3 SETR vs baseline PSPNet on ADE20k)
Semantic Segmentation (mlou: 84.4 Segformer vs 82.2 SETR on Cityscapes)
Vision Transformer for STR (VITSTR)
Parameter, FLOPS, Speed Efficient
Medical Image Segmentation (DSC: 77.5 TransUnet vs 71.3 R50-Vit baseline)
Limitations
Recommended Open-Source Implementations of Vit

Taught by

Open Data Science

Reviews

5.0 rating, based on 1 Class Central review

Start your review of Vision Transformer and Its Applications

Surya Narayanan

it is very usefull to learn about ViT in image ,about patching, embedding and feed the patched data to a model and etc

Taught by

Vision Transformers Explained + Fine-Tuning in Python

Code Panoptic Image Segmentation with Vision Transformer and Mask2Former - A PyTorch Tutorial

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Transformers - From Zero to Hero

Vision Transformers (ViTs) - A Beginner's Guide to Image Processing with Transformers

Blowing Up Transformer Decoder Architecture

Never Stop Learning.