The Data Addition Dilemma: Navigating Distribution Shifts in Machine Learning

Overview

Watch a 37-minute lecture from UC Berkeley researcher Irene Y Chen at the Simons Institute exploring why combining data from different sources for machine learning training isn't always beneficial. Learn about the "Data Addition Dilemma" where mixing dissimilar data sources can reduce accuracy, create fairness issues, and harm performance for underrepresented groups. Examine the fundamental trade-off between benefits of increased data scale and drawbacks of distribution shifts when combining datasets. Discover practical strategies and heuristics for deciding which data sources to combine to achieve optimal model performance improvements. Gain insights into key considerations for data collection and composition as AI models continue growing in size and complexity.