Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the critical issue of data leakage and reproducibility in machine learning-based science through this insightful 48-minute talk. Delve into a comprehensive investigation of reproducibility failures across 17 scientific fields, affecting 329 papers and leading to overly optimistic conclusions. Examine a detailed taxonomy of 8 types of leakage, ranging from basic errors to complex research challenges. Learn about proposed methodological changes, including model info sheets, to prevent leakage before publication. Discover the results of a reproducibility study in civil war prediction, revealing how complex ML models fail to outperform older statistical methods due to data leakage. Gain valuable insights from Sayash Kapoor, a Ph.D. candidate at Princeton University, whose research on ML methods in science has garnered recognition and been featured in prominent media outlets.