Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Multi-Modal LLMs for Image, Sound and Video - Episode 6.3

Donato Capitella via YouTube

Overview

Explore the architecture and training of multi-modal Large Language Models (MLLMs) in this 24-minute technical video lecture. Dive into Vision Transformers and their training methodologies using contrastive learning techniques like OpenAI's CLIP and Google's SigLIP. Learn how Vision Transformers serve as fundamental components in MLLMs with visual processing capabilities. Get hands-on experience by examining Google's open-weight PaliGemma implementation, seeing these theoretical concepts applied in a real-world multi-modal LLM. Access supplementary materials including a comprehensive Canvas PDF on LLM limitations and challenges, plus a Colab notebook for practical experimentation. The lecture systematically covers MLLM architecture, training approaches, Vision Transformer fundamentals, contrastive learning principles, and concludes with a practical lab session using PaliGemma, all supported by academic references and timestamps for easy navigation.

Syllabus

- MLLM Architecture
- Training MLLMs
- Vision Transformer
- Contrastive Learning CLIP, SigLIP
- Lab: PaliGemma
- Summary

Taught by

Donato Capitella

Reviews

Start your review of Multi-Modal LLMs for Image, Sound and Video - Episode 6.3

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.