Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

LLaVA 1.5 - Understanding Multi-modal Vision Language Models

Discover AI via YouTube

Overview

Learn about LLaVA 1.5, a cutting-edge multi-modal AI system, through this 44-minute technical video that explores its architecture, capabilities, and significance in the AI landscape. Dive deep into the system's core components, including its pre-trained vision encoder and large language model integration, while understanding how it processes and aligns visual features with semantic embeddings. Explore the sophisticated training methodology involving GPT-4 generated datasets comprising 160,000 unique language-image instruction-following samples, and discover the two-stage training process that optimizes both the projection matrix and language model components. Examine the architectural improvements from LLaVA 1.0 to 1.5, including the implementation of a 2-hidden-layer Perceptron for enhanced feature transformation and the integration of VICUNA 13B large language model. Understand the system's modular design that allows for component exchange between the LLM and Vision Transformer, making it particularly valuable for research and practical applications in scientific queries and complex reasoning tasks.

Syllabus

New LLaVA AI explained: GPT-4 VISION's Little Brother

Taught by

Discover AI

Reviews

Start your review of LLaVA 1.5 - Understanding Multi-modal Vision Language Models

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.