Unsupervised Learning of Spoken Language with Visual Context

Overview

Explore the cutting-edge research on unsupervised learning of spoken language using visual context in this 34-minute talk by Jim Glass from MIT. Delve into the challenges of automatic speech recognition and the potential of audio-visual embedding spaces to revolutionize language learning. Discover how deep learning models can associate images with spoken descriptions, creating word-like units from unannotated speech. Examine the experimental evaluation and analysis demonstrating the model's ability to cluster visual objects and their spoken counterparts. Learn about crowdsourcing audio-visual data, evaluation techniques for image search and annotation, and time-varying audio-visual affiliation. Gain insights into audio-visual grounding for localization, spatial distribution of speech clusters, and the broader implications for advancing speech recognition capabilities across the world's languages.

Syllabus

Intro
Challenge for Automatic Speech Recognition
A Perspective on Spoken Language Processing Most (-9%) of the worlds languages have not been addressed by resource and expert intensive supervised
Crossing the Vision Language Boundary
Learning an Audio/Visual Embedding Space?
Joint Audio-Visual Analysis Architecture
Crowdsourcing Audio-Visual Data
Evaluation: Image and Search Annotation
Evaluating via Image Search
Evaluating via Image Annotation
Time-varying Audio-Visual Affiliation
Audio-Visual Grounding for Localization
Examples of Audio-Visual Clusters
Cluster Analysis
Spatial Distribution of Speech Clusters
Final Message