Understanding Task Vectors in Vision-Language Models - Cross-Modal Representations

Overview

Explore groundbreaking research from UC Berkeley examining how vision-and-language models (VLMs) develop and employ "task vectors" - internal representations enabling cross-modal task performance. Dive into the discovery of how these latent activations capture task essences in a shared space across text and image modalities, allowing models to apply tasks learned in one format to queries in another. Learn about the three-phase query processing system where tokens evolve from raw inputs to task-specific representations and finally to answer-aligned vectors. Understand how combining instruction- and example-based task vectors creates more efficient representations for handling complex scenarios with limited data. Examine experimental evidence showing how text-based instruction vectors can guide image queries, leading to improved performance over traditional unimodal approaches. Discover the implications of this research for developing more adaptable and context-aware AI systems that use unified task embeddings for cross-modal inference.

Syllabus

Inside the VLM: NEW "Task Vectors" emerge (UC Berkeley)

Taught by

Discover AI

Reviews

Start your review of Understanding Task Vectors in Vision-Language Models - Cross-Modal Representations

Taught by

Mathematics Behind Large Language Models and Transformers

LLM Engineering: Master AI & Large Language Models (LLMs)

Using Vector Databases to Scale Multimodal Embeddings, Retrieval and Generation

Cross-modal Transfer Between Vision and Language for Protest Detection

Knowledge-Adapted Fine-Tuning for Medical Vision Language Models

Layer Increasing Network Scaling (LiNeS) - Preventing Catastrophic Forgetting in Language Models

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.