Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Linux Foundation

Common Corpus - Opening Data for Building Open Source LLMs

Linux Foundation via YouTube

Overview

Explore the groundbreaking "Common Corpus" project in this keynote address by Anastasia Stasenko, Co-founder of pleias and Associate Senior Lecturer at Sorbonne-Nouvelle. Delve into the challenges and opportunities surrounding the development of fully open source and reproducible Large Language Models (LLMs). Discover how the project aims to overcome the bottleneck of training data by establishing the largest collection of fully open data for LLM training, comprising 1T tokens. Learn about the legal issues surrounding copyrighted content in AI training, the importance of data quality for model performance, and how the Common Corpus project is empowering the open source AI community. Gain insights into the construction process of this massive corpus and its potential impact on advancing openness in generative AI.

Syllabus

Keynote: Common Corpus: Opening Data for Building Open Source LLMs - Anastasia Stasenko

Taught by

Linux Foundation

Reviews

Start your review of Common Corpus - Opening Data for Building Open Source LLMs

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.