The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing and Attribution in AI

Overview

Explore the Data Provenance Initiative, a groundbreaking effort to audit and trace over 1800 text datasets used in AI training. Learn about the legal and ethical concerns surrounding dataset licensing and attribution in the AI industry. Discover the tools and standards developed to trace dataset lineage, from sources and creators to license conditions and subsequent use. Examine the landscape analysis revealing stark differences between commercially open and closed datasets, including their composition and focus areas. Gain insights from speakers Anthony Chen, an engineer at Google DeepMind, and Shayne Longpre, a PhD candidate at MIT, as they present their findings and discuss the implications for data transparency and understanding in AI development. Delve into the challenges of dataset monopolization in areas such as low-resource languages, creative tasks, and synthetic training data.