Stable Video Diffusion: Model Architecture and Training Pipeline

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Grab it

Explore a detailed 14-minute video analysis of Stability AI's groundbreaking Stable Video Diffusion model, examining the architecture, training procedures, and results from their research paper. Learn about the innovative three-stage training process specifically designed for video generation models, capable of producing videos at 14 and 25 frames with customizable frame rates between 3 and 30 frames per second. Delve into crucial components including image pretraining, video curation stages, the LVD dataset development, filtering mechanisms, optical flow implementation, synthetic caption generation, and OCR detection. Understand the significance of ablation studies, high-quality fine-tuning processes, and see practical applications through text-to-video and image-to-video examples that demonstrate how this foundation model outperforms leading closed models from competitors like Runway and Pika Labs.

Syllabus

- Intro
- Model Architecture
- Training Stages
- Image Pretraining Stage
- Motivation for Image Pretraining
- Video Curation Stage
- Video data curation pipeline
- LVD Dataset
- Filtering Mechanisms
- Optical Flow
- Synthetic Captions
- OCR Detection
- LVD dataset summarised
- Ablation studies
- High quality fine-tuning
- Base Model
- Tex-to-video example
- Image-to-video example
- Conclusion