Multimodal AI: Understanding Large Language Models with Vision and Audio Capabilities

Overview

Learn about three distinct approaches to expanding Large Language Models (LLMs) beyond text-only capabilities in this technical video presentation. Explore the evolution of multimodal AI systems, starting with the integration of external tools, moving to adapter-based architectures, and culminating in unified models. Discover practical implementations through a hands-on demonstration using LLaMA 3.2 for vision tasks with Ollama, supported by extensive academic research and real-world applications. Gain insights into the future trajectory of multimodal AI development, with comprehensive references to key research papers and technical resources for further exploration.

Syllabus

Introduction -
Multimodal LLMs -
Path 1: LLM + Tools -
Path 2: LLM + Adapaters -
Path 3: Unified Models -
Example: LLaMA 3.2 for Vision Tasks Ollama -
What's next? -