Skip to main content
Comparing experiments helps you understand which changes improve performance and which cause regressions. Braintrust automatically highlights differences and makes it easy to drill into specific test cases.

Automatic comparison

When you run multiple experiments on the same dataset, Braintrust automatically compares them. The UI shows:
  • Score differences for each test case
  • Improvements highlighted in green
  • Regressions highlighted in red
  • Percentage changes in summary metrics
Enable diff mode with the toggle to focus on changes.

Compare in the UI

Select baseline

Choose which experiment to use as the baseline for comparison. The current experiment’s scores will be compared against the baseline.

Sort by changes

Sort the table by improvements or regressions to quickly identify which test cases changed the most:

View side-by-side diffs

Select any row to see detailed diffs for each field: The diff view shows:
  • Input (should be identical)
  • Output differences highlighted
  • Score changes with delta values
  • Metadata and timing comparisons

Compare multiple experiments

Add multiple experiments as comparisons to see how changes evolve over time:
  1. Select an experiment
  2. Click Add comparison
  3. Choose additional experiments
  4. View all in grid or summary layout

Grid layout

See outputs side-by-side in a table. Select which fields to display from the Fields dropdown. This is perfect for spot-checking specific test cases across experiments.

Summary layout

View aggregate metrics across all experiments in a reporting-friendly format. Both layouts respect filters and groupings.

Match test cases

Braintrust matches test cases across experiments using the input field by default. Test cases with identical inputs are considered the same example.

Custom comparison keys

If your input includes additional data, configure a custom comparison key:
  1. Go to project Configuration
  2. Define a SQL expression for matching
  3. Save the comparison key
Create comparison key For example, use input.user_query instead of the entire input object if other fields vary.

Compare trials

When you run multiple trials (repeated evaluations of the same input), group by input to see aggregate results:
  1. Select Input from the Group dropdown
  2. View average scores across trials
  3. Identify variance and reliability issues

Compare programmatically

Access comparison data through the SDK:
import { init } from "braintrust";

const experiment = init("My Project", {
  experiment: "new-experiment",
  baseExperiment: "baseline-experiment",
});

// Run your evaluation
const summary = await experiment.summarize();

console.log("Improvements:", summary.improvements);
console.log("Regressions:", summary.regressions);
console.log("Score delta:", summary.scoreDelta);

Analyze differences

When comparing experiments, look for:

Consistent improvements

Test cases that improve across all scorers indicate a genuine enhancement. These are your wins.

Consistent regressions

Test cases that regress across scorers indicate a problem. Investigate and fix before deploying.

Mixed results

Cases that improve on some scorers but regress on others require careful analysis. You may need to adjust scorer weights or accept trade-offs.

Edge case changes

Look for patterns in which types of inputs improved or regressed. Group by metadata to identify systematic issues.

Use comparison in CI/CD

Set score thresholds in CI to automatically catch regressions:
- name: Run Evals
  uses: braintrustdata/eval-action@v1
  with:
    api_key: ${{ secrets.BRAINTRUST_API_KEY }}
    runtime: node
    fail_on_regression: true
    min_score: 0.7
The action will fail the build if scores drop below thresholds or show significant regressions.

Next steps