Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
The three components of every eval: dataset, task, and scorer. Three approaches to scoring: deterministic, LLM-as-judge, and human review.
AI evals help you answer practical questions about your AI system. Questions like:
Without evals, the only way to answer these questions is to manually test a couple of prompts yourself and hope everything works. With evals, you can answer them systematically.
At its core, an eval is simple. It can be broken down into three main components: a dataset, a task, and a scorer.
A dataset is a collection of test cases. Each test case has an input, and optionally an expected output, metadata, and tags.
Different use cases call for different dataset shapes. Here are three examples that show how the shape of a dataset changes based on what you're evaluating.
Customer support eval dataset. For an open-ended task like generating a support response, there isn't one correct answer. The dataset is just a list of realistic customer messages, with no expected output column. You'll score these with an LLM-as-judge later instead of comparing against a fixed answer.
| Input |
|---|
| "How do I reset my password?" |
| "My order never arrived." |
| "Can I get a refund?" |
| "Your app keeps crashing on iOS." |
Factual Q&A dataset. For questions with a single correct answer, each row has an input and a strict expected output. You can score these deterministically with exact match or a similarity check.
| Input | Expected output |
|---|---|
| "What is the capital of France?" | "Paris" |
| "Who wrote Hamlet?" | "William Shakespeare" |
| "What year did WWII end?" | "1945" |
Music generation dataset. For creative generation tasks, the input might be a short prompt and the interesting information lives in metadata columns like genre, mood, or tempo. There's no single right output, but metadata lets you slice and filter results by category.
| Input | Genre | Mood |
|---|---|---|
| "Write a 30-second intro track." | Lo-fi hip hop | Chill |
| "Compose a victory fanfare." | Orchestral | Triumphant |
| "Generate background music for a coffee shop." | Jazz | Relaxed |
The dataset captures what you're testing. Its shape should follow the task.
The task defines what your AI system does with each input. It's the function that takes an input from the dataset and produces an output.
In the customer support example, the task is: "Generate a response to the customer's message." In a factual Q&A eval, the task might be: "Answer the question." The task is the thing you're evaluating.
The scorer determines whether the output is good. It takes the input, the output (and optionally the expected output), and returns a score.
This is where evals get interesting, because there are three distinct approaches to scoring.
Deterministic scorers use hard-coded rules. They're fast, free, and predictable. Use them when the correct answer is unambiguous.
def exact_match(output, expected):
return 1 if output.strip().lower() == expected.strip().lower() else 0
Good for: factual Q&A, classification tasks, format validation.
LLM-as-judge uses another language model to evaluate the output. You provide a rubric ("rate this response on helpfulness from 0 to 1") and the model returns a score. This is useful when correctness is subjective or when there's no single right answer.
Good for: tone and style evaluation, open-ended generation, customer support quality.
Human review puts a person in the loop. Reviewers look at outputs and assign scores manually. This is the most accurate approach but also the slowest and most expensive.
Good for: validating scorers, handling edge cases, high-stakes decisions.
In practice, most teams use a combination. Deterministic scorers handle the clear-cut checks, LLM-as-judge covers subjective quality, and human review validates that your automated scorers are working correctly.
In the next lesson, you'll build your first eval in the Braintrust UI, with no code required. You'll create a dataset, test two different chatbot personas in the playground, and compare results using a custom scorer.