Overview
Syllabus
Agenda
: Why do we need LLM evaluation?
Common evaluation axes
Why eval is more critical in Gen AI use cases
Why enterprises are often blocked on effective LLM evaluation
Common approaches to LLM evaluation
OSS benchmarks + metrics
LLM-as-a-judge
Annotation strategies
How can we do better than manual annotation strategies?
How data slices enable better LLM evaluation
How does LLM eval work with Snorkel?
Building a quality model
Using fine-grained benchmarks for next steps
Workflow overview review
Workflow—starting with the model
Workflow—Using an LLM as a judge
Workflow—the quality model
Chatbot demo
Annotating data in Snorkel Flow demo
Building labeling functions in Snorkel Flow demo
LLM evaluation in Snorkel Flow demo
Snorkel Flow jupyter notebook demo
Data slices in Snorkel Flow demo
Recap
Snorkel eval offer!
Q&A
Taught by
Snorkel AI