Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Visual Features for Context-Aware Speech Recognition - 2016

Center for Language & Speech Processing(CLSP), JHU via YouTube

Overview

Explore cutting-edge techniques for improving automatic speech recognition in challenging multimedia content through this comprehensive lecture by Florian Metze from Carnegie Mellon University. Delve into methods for adapting acoustic and language models using visual context from video, such as detected objects and scenes. Learn about experiments on "how-to" videos that demonstrate reduced word error rates by incorporating visual information. Examine approaches for handling speech variability, speaker-microphone distance, and audio-visual fusion. Gain insights into applications for robotics, human-computer interaction, and large-scale multimedia indexing. Discover how this research aims to bridge the gap between video-to-text and speech-to-text communities.

Syllabus

Intro
Outline
Automatic Speech Recognition
Speech Variability (Spectral)
Decoding Procedure
Experimental Setup
Simple Extensions
Performance on Switchboard
IARPA "Aladdin" Project
Speaker Microphone Distance (SMD)
Training SMD Extractors
Training SMD descriptors
SMD Results
SMD Analysis
Audio-Visual ASR
Speaker Attributes
Speaker Actions
Semantic Indexing CNN Features
Fusion of Approaches
Analysis "indoor" vs "outdoor"
Summary

Taught by

Center for Language & Speech Processing(CLSP), JHU

Reviews

Start your review of Visual Features for Context-Aware Speech Recognition - 2016

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.