How Vision Language Models Reason - Understanding LLaVa Chain of Thought

Overview

Explore a 37-minute technical presentation that delves into the inner workings of LLaVa Chain of Thought (CoT), focusing on data preparation, training methodologies, and inference-time scaling for Vision Language Models (VLMs). Learn about the importance of reasoning capabilities in VLMs, discover the process of synthetic data generation, and understand how datasets are created and utilized. Gain insights into inference-time scaling techniques and model training approaches, with practical demonstrations and access to relevant datasets through the Image-CoT-1m repository. Follow along with detailed explanations of dataset generation methods, complete with real-world examples and implementation strategies for building more capable vision-language models.

Syllabus

Intro
Overview of VLLMs
Why VLLMs Need Reasoning
LLaVa Chain of Thought
Synthetic Data Generation
Generating Datasets
Where to find the Datasets
How we Generated the Synthetic Data
Questions
What is Inference-Time Scaling?
Model Training