Preparing Fineweb - A Finely Cleaned Common Crawl Dataset for Language Models

Preparing Fineweb - A Finely Cleaned Common Crawl Dataset for Language Models

Trelis Research via YouTube Direct link

Deduplication strategies

8 of 12

8 of 12

Deduplication strategies

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Preparing Fineweb - A Finely Cleaned Common Crawl Dataset for Language Models

Automatically move to the next video in the Classroom when playback concludes

  1. 1 Common Crawl Data Processing Pipeline
  2. 2 Video Overview
  3. 3 Common Crawl Raw Dataset
  4. 4 Common Crawl improves over time?
  5. 5 Dataset Comparisons: C4, RefinedWeb, Fineweb, Llama 3, Phi-3
  6. 6 Data Processing Pipeline and Datatrove
  7. 7 Quality filters: Gopher, C4, Fineweb
  8. 8 Deduplication strategies
  9. 9 Fineweb edu: LLM assisted dataset filtering
  10. 10 Training a classifier for dataset filtering
  11. 11 My recommendation: Fineweb Edu Latest Crawl
  12. 12 Why is Llama 3 better than Llama 2?

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.