Overview
Syllabus
Welcome
Fine-web
Clustering and balancing data - Meta Paper
Clustering analysis in Colab
How to prepare chat / Q&A datasets synthetically
Q&A
Handling labeled data for fine-tuning
Setting a chat template for a tokenizer without one
Considerations on novel data and hallucinations
Issues with tokenizer and chat template not aligning
Using mixed-language datasets and their impact on training
Recommendations for models suitable for text classification
Extracting structured data from PDFs and tables
Multi-GPU training considerations
Using the LLM to VEC method for embeddings
Rag pipeline suggestions
Taught by
Trelis Research