Interpret eval results

View results in the UI

Open the Braintrust UI

Your eval will return a link to the corresponding results in Braintrust's UI. Open the link. You will land on a detailed view of the eval run that you selected.

You will see:

  • Diff mode toggle - Allows you to compare eval runs to each other. If you click the toggle, you will see the results of your current eval compared to the results of the baseline.
  • Filter bar - Allows you to focus in on a subset of test cases. You can filter by typing natural language or BTQL.
  • Summary panel - (On the right.) Shows trends across your scores, so you can home in on problematic areas.
  • Table - Shows the data for every test case in your eval run.

One eval run

Find a pattern to investigate

To find test cases to focus on, we recommend using the summary panel. Results in the panel are ordered by regressions count. This allows you to see the scorers with the biggest issues. You can also change the grouping to see summaries across any dimension in your metadata. For example, if you use separate datasets for distinct types of usecases, you can group by dataset to see which usecases are having the biggest issues. To get to an interesting subset of your test cases, click any of the filters in the summary panel.

Now that you've narrowed your test cases, you can view a test case in detail by clicking a row.

Examine the trace view

This will open the trace view. Here you can see all of the data for the trace for this test case, including input, output, metadata, and metrics for each span inside the trace.

Look at the scores and the output and decide whether the scores seem "right". Do good scores correspond to a good output? If not, you'll want to improve your evals by updating scorers or test cases.

Trace view

Export experiments

UI

To export an experiment's results, click on the three vertical dots in the upper right-hand corner of the UI. You can export as CSV or JSON.

Export experiments

API

To fetch the events in an experiment via the API, see Fetch experiment (POST form) or Fetch experiment (GET form).

SDK

If you need to access the data from a previous experiment, you can pass the open flag into init() and then just iterate through the experiment object:

import { init } from "braintrust";
 
async function openExperiment() {
  const experiment = init(
    "Say Hi Bot", // Replace with your project name
    {
      experiment: "my-experiment", // Replace with your experiment name
      open: true,
    },
  );
  for await (const testCase of experiment) {
    console.log(testCase);
  }
}

You can use the the asDataset()/as_dataset() function to automatically convert the experiment into the same fields you'd use in a dataset (input, expected, and metadata).

import { init } from "braintrust";
 
async function openExperiment() {
  const experiment = init(
    "Say Hi Bot", // Replace with your project name
    {
      experiment: "my-experiment", // Replace with your experiment name
      open: true,
    },
  );
 
  for await (const testCase of experiment.asDataset()) {
    console.log(testCase);
  }
}

For a more advanced overview of how to use reuse experiments as datasets, see Hill climbing.

On this page