Interpret evaluation results

Each offline evaluation creates an experiment, a permanent record of how the evaluated task performed on a dataset.

View results

To view the results of an experiment, go to Experiments in your project and select the experiment from the list.

Traces vs. spans - By default, experiments display as a table of traces where each row represents a complete trace with its root span. To view the individual spans in traces instead, select Display > Row type > Spans. View individual spans when you want to:
- Analyze specific operations within traces
- Find particular function calls or API requests
- Examine timing and token usage for individual operations
Spans view is optimized for analyzing individual operations. Experiment comparisons and diff mode are only available when viewing traces.
Metrics - Along with the scores you track, Braintrust tracks a number of metrics about your LLM calls that help you assess and understand performance. For example, if you’re trying to figure out why the average duration increased substantially when you change a model, it’s useful to look at both duration and token metrics to diagnose the underlying issue. To compute LLM metrics like token counts, make sure you wrap your LLM calls.
Experiment summary - Select Details to view:
- Comparisons to other experiments
- Scorers used in the evaluation
- Datasets tested
- Metadata like model and parameters
Copy the experiment ID from the bottom of the summary pane for referencing in code or sharing with teammates.

Filter results

Each project provides default table views with common filters for experiments, including:

Default view: Shows all traces in the experiment
Non-errors: Shows only traces without errors
Errors: Shows only traces with errors
Scorer errors: Show only traces with scorer errors
Unreviewed: Hides traces that have been human-reviewed
Assigned to me: Shows only traces assigned to the current user for human review

Use the menu to switch the table view. You can also use the Filter menu to add custom filtering. Use the Basic tab for point-and-click filtering, or switch to SQL to write precise SQL queries.

Default table views cannot be modified, but you can create custom table views based on custom filters and display settings.

Group results

Select Display > Group by to group the table by metadata fields to see patterns. By default, group rows show one experiment’s summary data. To view summary data for all experiments, select Include comparisons in group.

Order by regressions

Score and metric columns show summary statistics in their headers. To order columns by regressions, select Display > Columns > Order by regressions. Within grouped tables, this sorts rows by regressions of a specific score relative to a comparison experiment.

Examine individual traces

Select any row to open the trace view and see complete details:

Input, output, and expected values
Metadata and parameters
All spans in the trace hierarchy
Scores and their explanations
Timing and token usage

Ask yourself: Do good scores correspond to good outputs? If not, update your scorers or test cases. Use the button to expand the trace to fullscreen or the button to open it in a separate page.

When comparing experiments with diff mode enabled, only the default trace view is available. Timeline, Thread, and custom views are disabled during comparison.

View as a timeline

While viewing a trace, select Timeline to visualize the trace as a gantt chart. This view shows spans as horizontal bars where the width represents duration. Bars are color-coded by span type, making it easy to identify performance bottlenecks and understand the execution flow.

View as a conversation

While viewing a trace, select Thread to view the trace as a conversation thread. This view displays messages, tool calls, and scores in chronological order, ideal for debugging LLM conversations and multi-turn interactions. Use Find or press Cmd/Ctrl+F to search within the thread view and quickly locate specific content such as message text and score rationale. Matches are highlighted in-place using your browser’s native highlighting.

Thread view searches only within the currently open trace, not across all traces in your project.

Create custom trace views

While viewing a trace, select Views to create custom visualizations using natural language. Describe how you want to view your trace data and Loop will generate the code. For example:

“Create a view that renders a list of all tools available in this trace and their outputs”
“Render the video url from the trace’s metadata field and show simple thumbs up/down buttons”

By default, a custom trace view is only visible and editable by the user who created it. To share your view with all users in the project, select Save > Save as new view version > Update. See Create custom trace views for detailed examples, API reference, and how to embed views in your own applications.

Self-hosted deployments: If you restrict outbound access, allowlist https://www.braintrustsandbox.dev to enable custom views. This domain hosts the sandboxed iframe that securely renders custom view code.

Change span data format

When viewing a trace, each span field (input, output, metadata, etc.) displays data in a specific format. Change how a field displays by selecting the view mode dropdown in the field’s header. Available views:

Pretty - Parses objects deeply and renders values as Markdown (optimized for readability)
JSON - JSON highlighting and folding
YAML - YAML highlighting and folding
Tree - Hierarchical tree view for nested data structures

Additional format-specific views appear automatically for certain data types:

LLM - Formatted AI messages and tool calls with Markdown
LLM Raw - Unformatted AI messages and tool calls
HTML - Rendered HTML content

Your view mode selection is remembered per field type. To set a default view mode for all fields, go to Settings > Personal > Profile and select your preferred data view. See Personal settings for more details.

View raw trace data

When viewing a trace, select a span and then select the button in the span’s header to view the complete JSON representation. The raw data view shows all fields including metadata, inputs, outputs, and internal properties that may not be visible in other views. The raw data view has two tabs:

This span - Shows the complete JSON for the selected span only
Full trace - Shows the complete JSON for the entire trace

Use the search bar at the top of the dialog to find specific content within the data. Raw span data is useful when you need to:

Inspect the complete span structure for debugging
Find specific fields in large or deeply nested spans
Verify exact values and data types
Export or copy the full span for reproduction

Assign for review

You can assign experiment rows to team members for review, analysis, or follow-up action. Assignments are particularly useful for human review workflows, where you can assign specific rows that need human evaluation and distribute review work across multiple team members. See Assign rows for review for details.

Score retrospectively

Apply scorers to existing experiments:

Multiple cases: Select rows and use Score to apply chosen scorers
Single case: Open a trace and use Score in the trace view

Scores appear as additional spans within the trace.

Analyze with Loop

Use Loop to analyze experiment results, identify patterns, and get improvement suggestions. Loop can help you understand why certain test cases succeeded or failed and generate actionable recommendations. Select one or more experiments and open Loop to:

Summarize results: Get high-level insights about experiment performance, score trends, and key differences between experiments.
Drill into specific rows: Ask Loop to analyze test cases that performed poorly or identify patterns across failures.
Generate improvements: Loop can suggest changes to prompts, scorers, or datasets based on experiment results.
Create datasets: Extract problematic or interesting test cases into new datasets for targeted evaluation.
Generate code: Get sample code for implementing improvements to test in your next experiment.

Example queries:

“What improved from the last experiment?”
“Categorize the errors in this experiment”
“Pick the best scorers for this task”
“Why did the factuality score drop?”
“Create a dataset from the rows where the model failed”
“What patterns do you see in the low-scoring cases?”

Use aggregate scores

Aggregate scores are formulas that combine multiple scores into a single metric. They are useful when you track many scores but need a single metric to represent overall experiment quality. See Create aggregate scores for more details.

Download results

To download an experiment’s results, select and then Download as CSV or Download as JSON.

Change the display

Show and hide columns

Select Display > Columns and then:

Show or hide columns to focus on relevant data
Reorder columns by dragging them
Pin important columns to the left

All column settings are automatically saved when you save a view.

Create custom columns

Extract specific values from traces using custom columns:

Select Display > Columns > + Add custom column.
Name your column.
Choose from inferred fields or write a SQL expression.

Once created, filter and sort using your custom columns.

Create custom table views

Custom table views save your table configurations including filters, column order, column visibility, and display settings. This lets you quickly switch between different ways of viewing your experiment results. To create a custom table view:

Apply the filters and display settings you want.
Select Save as in the toolbar.
Enter a view name.

Custom table views are accessible and configurable by any member of the organization. Table views update dynamically with new rows matching saved criteria.

Adjust table layout

To change the table density to see more or less detail per row, select Display > Row height > Compact or Tall. To switch between different layouts, select Display > Layout and one of the following:

List: Default table view.
Grid: Compare outputs side-by-side.
Summary: Large-type summary of scores and metrics across all experiments.

Layouts respect view filters and are automatically saved when you save a view.

Next steps

Compare experiments systematically
Write scorers to measure what matters
Use playgrounds for rapid iteration
Run evaluations in CI/CD

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Interpret evaluation results

View results

Filter results

Group results

Order by regressions

Examine individual traces

View as a timeline

View as a conversation

Create custom trace views

Change span data format

View raw trace data

Assign for review

Score retrospectively

Analyze with Loop

Use aggregate scores

Download results

Change the display

Show and hide columns

Create custom columns

Create custom table views

Adjust table layout

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​View results

​Filter results

​Group results

​Order by regressions

​Examine individual traces

​View as a timeline

​View as a conversation

​Create custom trace views

​Change span data format

​View raw trace data

​Assign for review

​Score retrospectively

​Analyze with Loop

​Use aggregate scores

​Download results

​Change the display

​Show and hide columns

​Create custom columns

​Create custom table views

​Adjust table layout

​Next steps

View results

Filter results

Group results

Order by regressions

Examine individual traces

View as a timeline

View as a conversation

Create custom trace views

Change span data format

View raw trace data

Assign for review

Score retrospectively

Analyze with Loop

Use aggregate scores

Download results

Change the display

Show and hide columns

Create custom columns

Create custom table views

Adjust table layout

Next steps