Meet FLAVA: A Unified Vision and Language Foundation Model

Overview

Explore the development and capabilities of FLAVA, a unified vision and language model, in this 21-minute conference talk presented by Amanpreet Singh, Research Lead at Hugging Face. Dive into the journey towards creating a holistic universal model that excels in vision tasks, language tasks, and cross- and multi-modal vision and language tasks. Learn about the impressive performance of FLAVA on 35 diverse tasks spanning multiple modalities. Discover the evolution from domain-specific transformer models to the UniT (Unified Transformer) approach, and understand how FLAVA takes this concept even further. Gain insights into the model's architecture, functionality, and evaluation process. This presentation, recorded at Snorkel AI's 2023 Foundation Model Virtual Summit, offers valuable knowledge for those interested in state-of-the-art visio-linguistic pretraining and foundation models in artificial intelligence.

Syllabus

Intro
How do we build foundation
Successes of transformers in (specific) domains
UniT: Unified Transformer across domains
Can we take it one step further?
How does FLAVA work?
Stepping up the evaluation