Compare experiments

Comparing experiments helps you understand which changes improve performance and which cause regressions. Braintrust automatically highlights differences and makes it easy to drill into specific test cases.

Automatic comparison

When you run multiple experiments on the same dataset, Braintrust automatically compares them. The UI shows:

Score differences for each test case
Improvements highlighted in green
Regressions highlighted in red
Percentage changes in summary metrics

Enable diff mode with the toggle to focus on changes.

Compare in the UI

Experiment comparisons are only available when viewing traces. To compare experiments, select Display > Row type > Traces.

Select baseline

Choose which experiment to use as the baseline for comparison. The current experiment’s scores will be compared against the baseline.

Sort by changes

Sort the table by improvements or regressions to quickly identify which test cases changed the most:

View side-by-side diffs

Select any row to see detailed diffs for each field:

When diff mode is enabled, only the default trace view is available. Timeline, Thread, and custom views are disabled during comparison.

The diff view shows:

Input (should be identical)
Output differences highlighted
Score changes with delta values
Metadata and timing comparisons

Compare multiple experiments

Add multiple experiments as comparisons to see how changes evolve over time:

Select an experiment
Click Add comparison
Choose additional experiments
View all in grid or summary layout

Grid layout

See outputs side-by-side in a table. Select which fields to display from the Fields dropdown. This is perfect for spot-checking specific test cases across experiments.

Summary layout

View aggregate metrics across all experiments in a reporting-friendly format. Both layouts respect filters and groupings.

Match test cases

Braintrust matches test cases across experiments using the input field by default. Test cases with identical inputs are considered the same example.

Custom comparison keys

If your input includes additional data, configure a custom comparison key:

Go to project Configuration
Define a SQL expression for matching
Save the comparison key

For example, use input.user_query instead of the entire input object if other fields vary.

Compare trials

When you run multiple trials (repeated evaluations of the same input), group by input to see aggregate results:

Select Input from the Group dropdown
View average scores across trials
Identify variance and reliability issues

Compare programmatically

Access comparison data through the SDK:

import { init } from "braintrust";

const experiment = init("My Project", {
  experiment: "new-experiment",
  baseExperiment: "baseline-experiment",
});

// Run your evaluation
const summary = await experiment.summarize();

console.log("Improvements:", summary.improvements);
console.log("Regressions:", summary.regressions);
console.log("Score delta:", summary.scoreDelta);

Analyze differences

When comparing experiments, look for:

Consistent improvements

Test cases that improve across all scorers indicate a genuine enhancement. These are your wins.

Consistent regressions

Test cases that regress across scorers indicate a problem. Investigate and fix before deploying.

Mixed results

Cases that improve on some scorers but regress on others require careful analysis. You may need to adjust scorer weights or accept trade-offs.

Edge case changes

Look for patterns in which types of inputs improved or regressed. Group by metadata to identify systematic issues.

Use comparison in CI/CD

Set score thresholds in CI to automatically catch regressions:

- name: Run Evals
  uses: braintrustdata/eval-action@v1
  with:
    api_key: ${{ secrets.BRAINTRUST_API_KEY }}
    runtime: node
    fail_on_regression: true
    min_score: 0.7

The action will fail the build if scores drop below thresholds or show significant regressions.

Next steps

Interpret results in detail
Use playgrounds for rapid iteration
Write scorers to measure what matters
Run evaluations in CI/CD

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Automatic comparison