What is an eval?

The three components of every eval: dataset, task, and scorer. Three approaches to scoring: deterministic, LLM-as-judge, and human review.

The questions evals help you answer

AI evals help you answer practical questions about your AI system. Questions like:

  1. Which model is actually best for my use case?
  2. How does my system perform across different real-world inputs, like English versus Japanese, or TypeScript versus Python?
  3. How can I maintain high accuracy without driving costs through the roof?
  4. Do the AI-written responses reflect my company's tone, standards, and policies?
  5. How do I know when something breaks?

Without evals, the only way to answer these questions is to manually test a couple of prompts yourself and hope everything works. With evals, you can answer them systematically.

The three components of every eval

At its core, an eval is simple. It can be broken down into three main components: a dataset, a task, and a scorer.

Dataset

A dataset is a collection of test cases. Each test case has an input, and optionally an expected output, metadata, and tags.

Different use cases call for different dataset shapes. Here are three examples that show how the shape of a dataset changes based on what you're evaluating.

Customer support eval dataset. For an open-ended task like generating a support response, there isn't one correct answer. The dataset is just a list of realistic customer messages, with no expected output column. You'll score these with an LLM-as-judge later instead of comparing against a fixed answer.

Input
"How do I reset my password?"
"My order never arrived."
"Can I get a refund?"
"Your app keeps crashing on iOS."

Factual Q&A dataset. For questions with a single correct answer, each row has an input and a strict expected output. You can score these deterministically with exact match or a similarity check.

InputExpected output
"What is the capital of France?""Paris"
"Who wrote Hamlet?""William Shakespeare"
"What year did WWII end?""1945"

Music generation dataset. For creative generation tasks, the input might be a short prompt and the interesting information lives in metadata columns like genre, mood, or tempo. There's no single right output, but metadata lets you slice and filter results by category.

InputGenreMood
"Write a 30-second intro track."Lo-fi hip hopChill
"Compose a victory fanfare."OrchestralTriumphant
"Generate background music for a coffee shop."JazzRelaxed

The dataset captures what you're testing. Its shape should follow the task.

Task

The task defines what your AI system does with each input. It's the function that takes an input from the dataset and produces an output.

In the customer support example, the task is: "Generate a response to the customer's message." In a factual Q&A eval, the task might be: "Answer the question." The task is the thing you're evaluating.

Scorer

The scorer determines whether the output is good. It takes the input, the output (and optionally the expected output), and returns a score.

This is where evals get interesting, because there are three distinct approaches to scoring.

Three approaches to scoring

Deterministic scoring

Deterministic scorers use hard-coded rules. They're fast, free, and predictable. Use them when the correct answer is unambiguous.

python
def exact_match(output, expected):
    return 1 if output.strip().lower() == expected.strip().lower() else 0

Good for: factual Q&A, classification tasks, format validation.

LLM-as-judge

LLM-as-judge uses another language model to evaluate the output. You provide a rubric ("rate this response on helpfulness from 0 to 1") and the model returns a score. This is useful when correctness is subjective or when there's no single right answer.

Good for: tone and style evaluation, open-ended generation, customer support quality.

Human review

Human review puts a person in the loop. Reviewers look at outputs and assign scores manually. This is the most accurate approach but also the slowest and most expensive.

Good for: validating scorers, handling edge cases, high-stakes decisions.

In practice, most teams use a combination. Deterministic scorers handle the clear-cut checks, LLM-as-judge covers subjective quality, and human review validates that your automated scorers are working correctly.

What's next

In the next lesson, you'll build your first eval in the Braintrust UI, with no code required. You'll create a dataset, test two different chatbot personas in the playground, and compare results using a custom scorer.

Further reading

Trace everything