Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Multimodal AI: Understanding Large Language Models with Vision and Audio Capabilities

Shaw Talebi via YouTube

Overview

Learn about three distinct approaches to expanding Large Language Models (LLMs) beyond text-only capabilities in this technical video presentation. Explore the evolution of multimodal AI systems, starting with the integration of external tools, moving to adapter-based architectures, and culminating in unified models. Discover practical implementations through a hands-on demonstration using LLaMA 3.2 for vision tasks with Ollama, supported by extensive academic research and real-world applications. Gain insights into the future trajectory of multimodal AI development, with comprehensive references to key research papers and technical resources for further exploration.

Syllabus

Introduction -
Multimodal LLMs -
Path 1: LLM + Tools -
Path 2: LLM + Adapaters -
Path 3: Unified Models -
Example: LLaMA 3.2 for Vision Tasks Ollama -
What's next? -

Taught by

Shaw Talebi

Reviews

Start your review of Multimodal AI: Understanding Large Language Models with Vision and Audio Capabilities

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.