DeBERTa - Decoding-Enhanced BERT with Disentangled Attention

Overview

Explore a comprehensive video explanation of the DeBERTa (Decoding-enhanced BERT with Disentangled Attention) machine learning paper. Delve into the next iteration of BERT-style Self-Attention Transformer models, which surpasses RoBERTa in state-of-the-art performance on multiple NLP tasks. Learn about key improvements, including the disentangled attention mechanism and the use of relative positional encodings. Examine the model's architecture, efficiency in pretraining, and performance on downstream tasks. Follow along as the video breaks down complex concepts, presents experimental results, and discusses scaling up to 1.5 billion parameters. Gain insights into the paper's abstract, authors, and the model's impact on the SuperGLUE benchmark.

Syllabus

- Intro & Overview
- Position Encodings in Transformer's Attention Mechanism
- Disentangling Content & Position Information in Attention
- Disentangled Query & Key construction in the Attention Formula
- Efficient Relative Position Encodings
- Enhanced Mask Decoder using Absolute Position Encodings
- My Criticism of EMD
- Experimental Results
- Scaling up to 1.5 Billion Parameters
- Conclusion & Comments

Taught by

Yannic Kilcher

Reviews

Start your review of DeBERTa - Decoding-Enhanced BERT with Disentangled Attention

Taught by

Train Short, Test Long - Attention With Linear Biases Enables Input Length Extrapolation

Attention Is All You Need - Transformer Paper Explained

Attention Is All You Need

Longformer - The Long-Document Transformer

LLM Foundations - LLM Bootcamp

Linformer - Self-Attention with Linear Complexity

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.