Overview
Explore the intricacies of fine-tuning and deploying tiny text and vision models in this 44-minute tutorial. Dive into the architecture of multi-modal models, focusing on the Moondream model's components including its vision encoder (SigLIP), MLP (visionprojection), and language model (Phi). Learn how to apply LoRA adapters to multi-modal models and follow along with a hands-on fine-tuning notebook demo. Discover techniques for deploying custom APIs for multi-modal models, utilizing vLLM, and training models from scratch. Gain insights into multi-modal datasets and access a wealth of video resources to further your understanding of advanced vision and language processing techniques.
Syllabus
Fine-tuning tiny multi-modal models
Moondream server demo
Video Overview
Multi-modal model architecture
Moondream architecture
Moondream vision encoder SigLIP
Moondream MLP visionprojection
Moondream Language Model Phi
Applying LoRA adapters to a multi-modal model
Fine-tuning notebook demo
Deploying a custom API for multi-modal models
vLLM
Training a multi-modal model from scratch
Multi-modal datasets
Video resources
Taught by
Trelis Research