Explore the groundbreaking "Common Corpus" project in this keynote address by Anastasia Stasenko, Co-founder of pleias and Associate Senior Lecturer at Sorbonne-Nouvelle. Delve into the challenges and opportunities surrounding the development of fully open source and reproducible Large Language Models (LLMs). Discover how the project aims to overcome the bottleneck of training data by establishing the largest collection of fully open data for LLM training, comprising 1T tokens. Learn about the legal issues surrounding copyrighted content in AI training, the importance of data quality for model performance, and how the Common Corpus project is empowering the open source AI community. Gain insights into the construction process of this massive corpus and its potential impact on advancing openness in generative AI.
Keynote: Common Corpus: Opening Data for Building Open Source LLMs - Anastasia Stasenko
Taught by
Linux Foundation