
Experiment overview
The experiment view shows:- Summary metrics: Average scores across all test cases
- Table: Individual test cases with inputs, outputs, and scores
- Filter bar: Natural language or SQL queries to find specific cases
- Column controls: Show/hide columns and reorder by importance
- Diff mode toggle: Compare to baseline experiments
View summaries
The summary pane displays:- Comparisons to other experiments
- Scorers used in the evaluation
- Datasets tested
- Metadata like model and parameters

Analyze scores
Column header summaries
Score and metric columns show summary statistics. Filter by improvements or regressions to focus on specific areas.Group by metadata
Group the table by metadata fields or inputs to see patterns. For example, group by dataset to identify which use cases have the most issues. By default, group rows show one experiment’s summary data, and you can switch between experiments by selecting your desired aggregation.

Sort by regressions
Within grouped tables, sort rows by regressions of a specific score relative to a comparison experiment:Examine individual cases
Select any row to open the trace view and see complete details:- Input, output, and expected values
- Metadata and parameters
- All spans in the trace hierarchy
- Scores and their explanations
- Timing and token usage

Create custom columns
Extract specific values from traces using custom columns:- Select Add custom column from the Columns dropdown or use the + icon
- Name your column
- Choose from inferred fields or write a SQL expression
View experiment layouts
Grid layout
Compare outputs side-by-side in a table. Select fields to display from the Fields dropdown.Summary layout
Large-type summary of scores and metrics across all experiments. Perfect for reporting and presentations. Both layouts respect view filters.Understand metrics
Braintrust tracks these metrics automatically:- Duration: Time to complete the task span
- Offset: Time elapsed since trace start
- Prompt tokens: Tokens in the input
- Completion tokens: Tokens in the output
- Total tokens: Combined token count
- LLM duration: Time spent in LLM calls
- Estimated cost: Approximate cost based on pricing
task subspan, excluding LLM-as-a-judge scorer calls.
To compute LLM metrics, wrap your LLM calls with Braintrust provider wrappers.
Use aggregate scores
Aggregate scores combine multiple scores into a single metric for reporting and comparison. Create them in your project’s Configuration page under Add aggregate score.
- Weighted average: A weighted average of selected scores
- Minimum: The minimum value among selected scores
- Maximum: The maximum value among selected scores
Score retrospectively
Apply scorers to existing experiments:- Multiple cases: Select rows and use Score to apply chosen scorers
- Single case: Open a trace and use Score in the trace view
Analyze across experiments
Compare performance across multiple experiments using visualizations.Bar chart
On the Experiments page, view scores as a bar chart by selecting Score comparison from the X axis selector: Group by metadata fields to create comparative bar charts:Scatter plot
Select a metric on the x-axis to construct scatter plots. For example, compare the relationship between accuracy and duration:Export experiments
- UI
- SDK
- API
To export an experiment’s results, open the menu next to the experiment name. You can export as CSV or JSON, and choose whether to download all fields.

Next steps
- Compare experiments systematically
- Write scorers to measure what matters
- Use playgrounds for rapid iteration
- Run evaluations in CI/CD