Completed
Quality filters: Gopher, C4, Fineweb
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Preparing Fineweb - A Finely Cleaned Common Crawl Dataset for Language Models
Automatically move to the next video in the Classroom when playback concludes
- 1 Common Crawl Data Processing Pipeline
- 2 Video Overview
- 3 Common Crawl Raw Dataset
- 4 Common Crawl improves over time?
- 5 Dataset Comparisons: C4, RefinedWeb, Fineweb, Llama 3, Phi-3
- 6 Data Processing Pipeline and Datatrove
- 7 Quality filters: Gopher, C4, Fineweb
- 8 Deduplication strategies
- 9 Fineweb edu: LLM assisted dataset filtering
- 10 Training a classifier for dataset filtering
- 11 My recommendation: Fineweb Edu Latest Crawl
- 12 Why is Llama 3 better than Llama 2?