Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

How Vision Language Models Reason - Understanding LLaVa Chain of Thought

Oxen via YouTube

Overview

Explore a 37-minute technical presentation that delves into the inner workings of LLaVa Chain of Thought (CoT), focusing on data preparation, training methodologies, and inference-time scaling for Vision Language Models (VLMs). Learn about the importance of reasoning capabilities in VLMs, discover the process of synthetic data generation, and understand how datasets are created and utilized. Gain insights into inference-time scaling techniques and model training approaches, with practical demonstrations and access to relevant datasets through the Image-CoT-1m repository. Follow along with detailed explanations of dataset generation methods, complete with real-world examples and implementation strategies for building more capable vision-language models.

Syllabus

Intro
Overview of VLLMs
Why VLLMs Need Reasoning
LLaVa Chain of Thought
Synthetic Data Generation
Generating Datasets
Where to find the Datasets
How we Generated the Synthetic Data
Questions
What is Inference-Time Scaling?
Model Training

Taught by

Oxen

Reviews

Start your review of How Vision Language Models Reason - Understanding LLaVa Chain of Thought

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.