Reliable Data for Large ML Models: Principles and Practices

Overview

Explore the critical aspects of ensuring reliable data for large machine learning models in this conference talk from SREcon23 Europe/Middle East/Africa. Delve into the challenges posed by the increasing scale and complexity of ML training data and model output, particularly in the context of Large Language Models (LLMs). Learn how foundational Site Reliability Engineering (SRE) principles can be applied to address these challenges, including managing tradeoffs between flexibility and stability, accounting for human operations within systems, and defining clear reliability requirements. Discover common data reliability challenges in ML and how they manifest differently in LLMs compared to traditional supervised ML systems. Gain insights into best practices for assessing and managing ML data risks in production systems, drawing from real-world experience and established SRE principles.