Stanford Seminar - Audio Research: Transformers for Applications in Audio, Speech and Music

Overview

Explore the cutting-edge applications of Transformers in audio, speech, and music research in this Stanford seminar. Delve into the Transformer revolution and its impact on audio processing, from language modeling to understanding and synthesis. Learn about spectrograms, raw audio synthesis techniques, and the challenges of classical FM synthesis. Compare WaveNet and Transformer models for next sample prediction, and discover a framework for generative and contrastive learning of audio representations. Investigate acoustic scene understanding and the combination of Vector Quantization with auto-encoders and Transformers. Examine the potential of audio Transformers for large-scale audio understanding, and explore the use of wavelets on Transformer embeddings. Gain insights into the methodology, results, and what the front-end learns in this comprehensive overview of audio research using Transformer architectures.

Syllabus

Introduction.
Transformers for Music and Audio: Language Modelling to Understanding to Synthesis.
The Transformer Revolution.
Models getting bigger ....
What are spectograms.
Raw Audio Synthesis: Difficulty Classical FM synthesis Karplus Strong.
Baseline : Classic WaveNet.
Improving Transformer Baseline • Major bottleneck of Transformers.
Results & Unconditioned Setup • Evaluation Criterion o Comparing Wavenet, Transformers on next sample prediction Top-5 accuracy, out of 256 possible states as a error metric Why this setup 7 1. Application agnostic 2. Suits training setup.
A Framework for Generative and Contrastive Learning of Audio Representations.
Acoustic Scene Understanding.
Recipe of doing.
Turbocharging best of two worlds Vector Quantization: A powerful and under-uilized algorithm Combining VQwih auto-encoders and Transformers.
Turbocharging best of two worlds Leaming clusters from vector quantization Use long term dependency kaming with that cluster based representation for markovian assumption Better we become in prediction, the better the summarization is.
Audio Transformers: Transformer Architectures for Large Scale Audio Understanding - Adieu Convolutions Stanford University March 2021.
Wavelets on Transformer Embeddings.
Methodology + Results.
What does it learn -- the front end.
Final Thoughts.