Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Unsupervised Learning of Spoken Language with Visual Context

MITCBMM via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the cutting-edge research on unsupervised learning of spoken language using visual context in this 34-minute talk by Jim Glass from MIT. Delve into the challenges of automatic speech recognition and the potential of audio-visual embedding spaces to revolutionize language learning. Discover how deep learning models can associate images with spoken descriptions, creating word-like units from unannotated speech. Examine the experimental evaluation and analysis demonstrating the model's ability to cluster visual objects and their spoken counterparts. Learn about crowdsourcing audio-visual data, evaluation techniques for image search and annotation, and time-varying audio-visual affiliation. Gain insights into audio-visual grounding for localization, spatial distribution of speech clusters, and the broader implications for advancing speech recognition capabilities across the world's languages.

Syllabus

Intro
Challenge for Automatic Speech Recognition
A Perspective on Spoken Language Processing Most (-9%) of the worlds languages have not been addressed by resource and expert intensive supervised
Crossing the Vision Language Boundary
Learning an Audio/Visual Embedding Space?
Joint Audio-Visual Analysis Architecture
Crowdsourcing Audio-Visual Data
Evaluation: Image and Search Annotation
Evaluating via Image Search
Evaluating via Image Annotation
Time-varying Audio-Visual Affiliation
Audio-Visual Grounding for Localization
Examples of Audio-Visual Clusters
Cluster Analysis
Spatial Distribution of Speech Clusters
Final Message

Taught by

MITCBMM

Reviews

Start your review of Unsupervised Learning of Spoken Language with Visual Context

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.