GPU Accelerated Data Curation for Large Language Models

Overview

Explore GPU-accelerated data curation techniques for large language models in this 30-minute talk by Ryan Wolf, a Deep Learning Algorithm Engineer at NVIDIA. Learn about the importance of well-curated datasets in scaling LLMs and discover how to create high-quality datasets using NeMo Curator, an open-source library for GPU-accelerated data curation. Gain insights into scaling datasets to trillions of tokens efficiently, a crucial yet often overlooked aspect of machine learning. Benefit from Ryan's expertise in AI systems and his current focus on developing NeMo Curator. This MLOps.community presentation, part of the DE4AI series, offers valuable knowledge for those interested in advancing their understanding of data curation for foundation models.