Dated Data: Tracing Knowledge Cutoffs in Large Language Models
Center for Language & Speech Processing(CLSP), JHU via YouTube
Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Watch a 14-minute award-winning conference presentation from Johns Hopkins University's Center for Language & Speech Processing that explores the complexities of knowledge cutoff dates in Large Language Models (LLMs). Dive into the critical distinction between reported and effective cutoff dates for training data, and understand why this matters for applications requiring current information. Learn about a novel approach to estimate effective cutoffs at the resource level by probing across different data versions, without needing access to pre-training data. Discover key findings that reveal significant discrepancies between reported and effective cutoffs, attributed to temporal misalignments in CommonCrawl data and complications in LLM deduplication schemes. Gain valuable insights into why cutoff dates are more nuanced than previously thought, and understand the implications for both LLM dataset curators and practitioners implementing these models.
Syllabus
Dated Data: Tracing Knowledge Cutoffs in Large Language Models (COLM 2024 Outstanding Paper Award)
Taught by
Center for Language & Speech Processing(CLSP), JHU