Skip to main content
After running evaluations, the experiment view helps you understand performance, identify issues, and find opportunities for improvement through detailed analysis and comparison tools. One eval run

Experiment overview

The experiment view shows:
  • Summary metrics: Average scores across all test cases
  • Table: Individual test cases with inputs, outputs, and scores
  • Filter bar: Natural language or SQL queries to find specific cases
  • Column controls: Show/hide columns and reorder by importance
  • Diff mode toggle: Compare to baseline experiments

View summaries

The summary pane displays:
  • Comparisons to other experiments
  • Scorers used in the evaluation
  • Datasets tested
  • Metadata like model and parameters
Experiment summary Copy the experiment ID from the bottom of the summary pane for referencing in code or sharing with teammates.

Analyze scores

Column header summaries

Score and metric columns show summary statistics. Filter by improvements or regressions to focus on specific areas.

Group by metadata

Group the table by metadata fields or inputs to see patterns. For example, group by dataset to identify which use cases have the most issues. By default, group rows show one experiment’s summary data, and you can switch between experiments by selecting your desired aggregation. Summary experiment aggregations To view summary data for all experiments, select Include comparisons in group. Include comparisons in group

Sort by regressions

Within grouped tables, sort rows by regressions of a specific score relative to a comparison experiment:

Examine individual cases

Select any row to open the trace view and see complete details:
  • Input, output, and expected values
  • Metadata and parameters
  • All spans in the trace hierarchy
  • Scores and their explanations
  • Timing and token usage
Trace view Ask yourself: Do good scores correspond to good outputs? If not, update your scorers or test cases.

Create custom columns

Extract specific values from traces using custom columns:
  1. Select Add custom column from the Columns dropdown or use the + icon
  2. Name your column
  3. Choose from inferred fields or write a SQL expression
Once created, filter and sort using your custom columns.

View experiment layouts

Grid layout

Compare outputs side-by-side in a table. Select fields to display from the Fields dropdown.

Summary layout

Large-type summary of scores and metrics across all experiments. Perfect for reporting and presentations. Both layouts respect view filters.

Understand metrics

Braintrust tracks these metrics automatically:
  • Duration: Time to complete the task span
  • Offset: Time elapsed since trace start
  • Prompt tokens: Tokens in the input
  • Completion tokens: Tokens in the output
  • Total tokens: Combined token count
  • LLM duration: Time spent in LLM calls
  • Estimated cost: Approximate cost based on pricing
Metrics are computed on the task subspan, excluding LLM-as-a-judge scorer calls.
To compute LLM metrics, wrap your LLM calls with Braintrust provider wrappers.

Use aggregate scores

Aggregate scores combine multiple scores into a single metric for reporting and comparison. Create them in your project’s Configuration page under Add aggregate score. Add aggregate score Braintrust supports three types:
  • Weighted average: A weighted average of selected scores
  • Minimum: The minimum value among selected scores
  • Maximum: The maximum value among selected scores
Aggregate scores are useful when you track many scores but need a single metric to represent overall experiment quality.

Score retrospectively

Apply scorers to existing experiments:
  • Multiple cases: Select rows and use Score to apply chosen scorers
  • Single case: Open a trace and use Score in the trace view
Scores appear as additional spans within the trace.

Analyze across experiments

Compare performance across multiple experiments using visualizations.

Bar chart

On the Experiments page, view scores as a bar chart by selecting Score comparison from the X axis selector: Group by metadata fields to create comparative bar charts:

Scatter plot

Select a metric on the x-axis to construct scatter plots. For example, compare the relationship between accuracy and duration:

Export experiments

To export an experiment’s results, open the menu next to the experiment name. You can export as CSV or JSON, and choose whether to download all fields.Export experiments

Next steps