Visual Features for Context-Aware Speech Recognition - 2016

Overview

Explore cutting-edge techniques for improving automatic speech recognition in challenging multimedia content through this comprehensive lecture by Florian Metze from Carnegie Mellon University. Delve into methods for adapting acoustic and language models using visual context from video, such as detected objects and scenes. Learn about experiments on "how-to" videos that demonstrate reduced word error rates by incorporating visual information. Examine approaches for handling speech variability, speaker-microphone distance, and audio-visual fusion. Gain insights into applications for robotics, human-computer interaction, and large-scale multimedia indexing. Discover how this research aims to bridge the gap between video-to-text and speech-to-text communities.

Syllabus

Intro
Outline
Automatic Speech Recognition
Speech Variability (Spectral)
Decoding Procedure
Experimental Setup
Simple Extensions
Performance on Switchboard
IARPA "Aladdin" Project
Speaker Microphone Distance (SMD)
Training SMD Extractors
Training SMD descriptors
SMD Results
SMD Analysis
Audio-Visual ASR
Speaker Attributes
Speaker Actions
Semantic Indexing CNN Features
Fusion of Approaches
Analysis "indoor" vs "outdoor"
Summary

Taught by

Center for Language & Speech Processing(CLSP), JHU

Reviews

Start your review of Visual Features for Context-Aware Speech Recognition - 2016

Taught by

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.