How to Evaluate LLM Performance for Domain-Specific Use Cases

Overview

Explore the critical aspects of evaluating Large Language Model (LLM) performance for enterprise use cases in this comprehensive 57-minute video presentation. Delve into the nuances of LLM evaluation, learn techniques for assessing response accuracy at scale, and discover methods for identifying areas requiring additional fine-tuning. Gain insights into common challenges and approaches in LLM evaluation, understand the importance of data-centric evaluation methods, and see practical demonstrations of evaluation techniques using Snorkel AI's platform. Follow along as experts discuss topics ranging from OSS benchmarks and metrics to using LLMs as judges, and explore how data slices can enhance evaluation processes. Witness real-world applications through demos of chatbot evaluation, data annotation, and quality model building in Snorkel Flow.

Syllabus

Agenda
: Why do we need LLM evaluation?
Common evaluation axes
Why eval is more critical in Gen AI use cases
Why enterprises are often blocked on effective LLM evaluation
Common approaches to LLM evaluation
OSS benchmarks + metrics
LLM-as-a-judge
Annotation strategies
How can we do better than manual annotation strategies?
How data slices enable better LLM evaluation
How does LLM eval work with Snorkel?
Building a quality model
Using fine-grained benchmarks for next steps
Workflow overview review
Workflow—starting with the model
Workflow—Using an LLM as a judge
Workflow—the quality model
Chatbot demo
Annotating data in Snorkel Flow demo
Building labeling functions in Snorkel Flow demo
LLM evaluation in Snorkel Flow demo
Snorkel Flow jupyter notebook demo
Data slices in Snorkel Flow demo
Recap
Snorkel eval offer!
Q&A