Automatic comparison
When you run multiple experiments on the same dataset, Braintrust automatically compares them. The UI shows:- Score differences for each test case
- Improvements highlighted in green
- Regressions highlighted in red
- Percentage changes in summary metrics
Compare in the UI
Select baseline
Choose which experiment to use as the baseline for comparison. The current experiment’s scores will be compared against the baseline.Sort by changes
Sort the table by improvements or regressions to quickly identify which test cases changed the most:View side-by-side diffs
Select any row to see detailed diffs for each field: The diff view shows:- Input (should be identical)
- Output differences highlighted
- Score changes with delta values
- Metadata and timing comparisons
Compare multiple experiments
Add multiple experiments as comparisons to see how changes evolve over time:- Select an experiment
- Click Add comparison
- Choose additional experiments
- View all in grid or summary layout
Grid layout
See outputs side-by-side in a table. Select which fields to display from the Fields dropdown. This is perfect for spot-checking specific test cases across experiments.Summary layout
View aggregate metrics across all experiments in a reporting-friendly format. Both layouts respect filters and groupings.Match test cases
Braintrust matches test cases across experiments using theinput field by default. Test cases with identical inputs are considered the same example.
Custom comparison keys
If yourinput includes additional data, configure a custom comparison key:
- Go to project Configuration
- Define a SQL expression for matching
- Save the comparison key

input.user_query instead of the entire input object if other fields vary.
Compare trials
When you run multiple trials (repeated evaluations of the same input), group by input to see aggregate results:- Select Input from the Group dropdown
- View average scores across trials
- Identify variance and reliability issues
Compare programmatically
Access comparison data through the SDK:Analyze differences
When comparing experiments, look for:Consistent improvements
Test cases that improve across all scorers indicate a genuine enhancement. These are your wins.Consistent regressions
Test cases that regress across scorers indicate a problem. Investigate and fix before deploying.Mixed results
Cases that improve on some scorers but regress on others require careful analysis. You may need to adjust scorer weights or accept trade-offs.Edge case changes
Look for patterns in which types of inputs improved or regressed. Group by metadata to identify systematic issues.Use comparison in CI/CD
Set score thresholds in CI to automatically catch regressions:Next steps
- Interpret results in detail
- Use playgrounds for rapid iteration
- Write scorers to measure what matters
- Run evaluations in CI/CD