Multi-Modal LLMs for Image, Sound and Video - Episode 6.3

Overview

Explore the architecture and training of multi-modal Large Language Models (MLLMs) in this 24-minute technical video lecture. Dive into Vision Transformers and their training methodologies using contrastive learning techniques like OpenAI's CLIP and Google's SigLIP. Learn how Vision Transformers serve as fundamental components in MLLMs with visual processing capabilities. Get hands-on experience by examining Google's open-weight PaliGemma implementation, seeing these theoretical concepts applied in a real-world multi-modal LLM. Access supplementary materials including a comprehensive Canvas PDF on LLM limitations and challenges, plus a Colab notebook for practical experimentation. The lecture systematically covers MLLM architecture, training approaches, Vision Transformer fundamentals, contrastive learning principles, and concludes with a practical lab session using PaliGemma, all supported by academic references and timestamps for easy navigation.