Evaluate systematically

Evaluation lets you measure AI application performance systematically, turning non-deterministic outputs into an effective feedback loop. Run experiments to understand whether changes improve or regress quality, drill into specific examples, and avoid playing whack-a-mole with issues.

Why evaluate

In AI development, it’s hard to understand how updates impact performance. This breaks typical software workflows, making iteration feel like guesswork instead of engineering. Evaluations solve this by helping you:

Understand whether an update improves or regresses performance
Quickly drill down into good and bad examples
Diff specific examples versus prior runs
Catch regressions before they reach production
Build confidence in your changes

Offline vs. online evaluation

Braintrust supports two complementary modes of evaluation that work together to ensure quality throughout the development lifecycle.

Offline evaluation (experiments)

Run structured experiments during development to compare approaches systematically. Test changes against curated datasets before deployment, compare prompts or models side-by-side, and catch regressions in CI/CD. Offline evaluation helps you ship better changes by validating improvements before they reach production. Key workflows:

Run evaluations with the Eval() function, CLI, or UI
Interpret results to find improvements and regressions
Compare experiments to measure impact
Use playgrounds for rapid iteration

Online evaluation (production scoring)

Monitor production quality by scoring live requests automatically. Evaluate real user interactions at scale, catch regressions immediately, and identify new edge cases for offline testing. Online evaluation helps you maintain quality by continuously monitoring production behavior. Key workflows:

Score production traces with automatic scoring rules
Monitor with dashboards to track metrics over time
Deploy with monitoring to catch issues in production

Continuous feedback loop

Both modes use the same scorer library, enabling a continuous improvement cycle:

Develop and test with offline experiments.
Deploy changes with confidence.
Monitor production with online scoring.
Feed production insights back into datasets for offline testing.

Anatomy of an evaluation

Every evaluation consists of three parts:

Data

A dataset of test cases containing inputs, expected outputs (optional), and metadata. Build datasets from production logs, user feedback, or manual curation. → Learn about datasets

Task

The AI function you want to test - any function that takes an input and returns an output. This is typically an LLM call, but can be any logic you want to evaluate.

Scores

Scoring functions that measure quality by comparing inputs, outputs, and expected values. Use automated scorers like factuality or similarity, LLM-as-a-judge scorers, or custom code-based logic. → Learn about scorers

Run evaluations

Run evaluations using the Eval() function in code, the braintrust eval CLI command, or directly in the UI:

import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("Say Hi Bot", {
  data: () => [
    { input: "Foo", expected: "Hi Foo" },
    { input: "Bar", expected: "Hello Bar" },
  ],
  task: async (input) => {
    return "Hi " + input; // Replace with your LLM call
  },
  scores: [Factuality],
});

Running your eval automatically creates an experiment, displays a summary in your terminal, and populates the UI. You can also create experiments directly in the UI from the Experiments page or use playgrounds for iterative testing. → Learn how to run evaluations

Interpret results

The experiment view shows:

Summary metrics for all scores
Table of test cases with individual scores
Detailed traces for each example
Comparisons to baseline experiments
Improvements and regressions highlighted

Filter by high or low scores, sort by changes, and drill into specific examples to understand behavior. → Learn how to interpret results

Compare experiments

Run multiple experiments to compare approaches:

Different prompts or models
Various parameter configurations
Alternative architectures or flows
Before and after code changes

Braintrust highlights score differences and shows which test cases improved or regressed. → Learn how to compare experiments

Use playgrounds

Playgrounds provide a no-code environment for rapid experimentation:

Test prompts and models interactively
Run evaluations on datasets without code
Compare results side-by-side
Share configurations with teammates

Use playgrounds for quick iteration, then codify winning approaches in your application. → Learn how to use playgrounds

Write effective components

Create high-quality evaluation components:

Prompts: Clear instructions that guide model behavior
Scorers: Reliable functions that measure what matters
Datasets: Representative examples covering edge cases

Use Loop to generate and optimize these components based on your production data.

Next steps

Run evaluations with the Eval function
Write prompts that guide model behavior
Write scorers to measure quality
Use playgrounds for rapid iteration

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Evaluate systematically

Why evaluate

Offline vs. online evaluation

Offline evaluation (experiments)

Online evaluation (production scoring)

Continuous feedback loop

Anatomy of an evaluation

Data

Task

Scores

Run evaluations

Interpret results

Compare experiments

Use playgrounds

Write effective components

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Why evaluate

​Offline vs. online evaluation

​Offline evaluation (experiments)

​Online evaluation (production scoring)

​Continuous feedback loop

​Anatomy of an evaluation

​Data

​Task

​Scores

​Run evaluations

​Interpret results

​Compare experiments

​Use playgrounds

​Write effective components

​Next steps

Why evaluate

Offline vs. online evaluation

Offline evaluation (experiments)

Online evaluation (production scoring)

Continuous feedback loop

Anatomy of an evaluation

Data

Task

Scores

Run evaluations

Interpret results

Compare experiments

Use playgrounds

Write effective components

Next steps