Evaluating World Models in Generative AI - A Critique of AI Intelligence Assessment

Overview

Learn about groundbreaking research from Harvard, Cornell, and MIT that challenges conventional AI evaluation methods in this 30-minute video presentation. Explore how traditional metrics like next-token prediction may inadequately assess AI models' true understanding of language structures. Dive into the application of the Myhill-Nerode theorem for developing new evaluation frameworks that measure sequence compression and distinction capabilities. Follow along through real-world examples including NYC navigation, Othello gameplay, and logic puzzles that demonstrate how AI models can appear competent while lacking coherent internal world models. Examine the timeline of AI investment, emergence theories, world modeling concepts, and Chomsky's hierarchy before delving into finite automaton theory and new evaluation metrics. Consider critical findings about AI limitations, explore counterarguments, and reflect on questions of AI trustworthiness and sufficiency for real-world applications.

Syllabus

AI tech companies invest US 1T
AI Emergence?
AI World Models
Noam Chomsky Hierachy
Finite Automaton
NEW IDEA for AI framework
LLM mimicking sequences only?
Myhill-Nerode theorem
2 new evaluation metrices LLM
DFA of Manhattan
LLM internal representation
Main Findings
Counterarguments
Trust in AI?
AI is good enough?