Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a 38-minute conference talk that delves into Data Prep Kit (DPK), an open-source cloud-native toolkit designed for scalable data preparation in Generative AI applications. Learn how data preparation has become a critical phase in the GenAI lifecycle, particularly for developing high-performing Large Language Models (LLMs). Discover the essential processes of cleaning, filtering, and transforming text and code datasets from various sources into tokenized forms suitable for LLM training. Understand the unique challenges in data preparation for different use cases, whether for pre-training, fine-tuning, or instruct tuning. See how IBM's open-source initiative aims to foster community collaboration and reduce development time through DPK, which has been successfully implemented in IBM's Granite models. Gain insights from IBM experts Daiki Tsuzuku and Takuya Goto on leveraging this comprehensive toolkit to enhance your GenAI development workflow.
Syllabus
Data Prep Kit: A Comprehensive Cloud-Native Toolkit for Scalable Da... - Daiki Tsuzuku & Takuya Goto
Taught by
Linux Foundation