Want High Performing LLMs? It's All About Your Data - Importance of Quality Data in AI Development

Overview

Discover the critical role of high-quality data in developing high-performing large language models (LLMs) for production environments in this 34-minute conference talk from the LLMs in Production Conference. Explore the challenges of building LLMs that work effectively at scale, and learn why data quality is becoming the key differentiator in model performance. Delve into the importance of pre-training, common pitfalls to avoid, and strategies for ensuring data scientists work with top-notch data throughout the machine learning workflow. Gain insights on data-centric AI, fine-tuning techniques, and the significance of predictability in model outcomes. Examine modern ML trends, including open-source models, chain-of-thought prompting, and context retrieval, while understanding the shift towards instruction-based approaches in AI development.

Syllabus

Intro
AI is mainstream
The Hype
The Transformer Paper
DataCentric AI
Modern ML Wave
Open Source Models
Good Quality Data
Importance of Data
Fine Tuning
Predictability
DataCentric Development
Chain of Thought Prompt
Context Retrieval
Instruction