Demystifying Data Curation for Pretrained Language Models

Overview

Explore a detailed guest lecture that demystifies the complex process of data curation for pretrained language models, delivered by expert Kylo Lo at the University of Utah Data Science department. Gain valuable insights into the methodologies and best practices of preparing and organizing data sets specifically designed for training large language models. Learn about the critical considerations, challenges, and solutions in data curation that directly impact model performance and reliability. Discover practical approaches to data selection, cleaning, and preprocessing through this comprehensive 47-minute presentation that begins with a brief introduction before diving into the core technical content.