Track dataset performance

Monitor how dataset rows perform across experiments.

View experiment runs

See all experiments that used a dataset:

Go to Datasets.
Open your dataset.
In the right panel, select Runs.
Review performance metrics across experiments.

Runs display as charts that show score trends over time. The time axis flows from oldest (left) to newest (right), making it easy to track performance evolution.

Filter experiment runs

To narrow down the list of experiment runs, you can filter by time range, tag, or use SQL. Filter by time range: Click and drag across any region of the chart to select a time range. The table below updates to show only experiments in that range. To clear the filter, click clear. This helps you focus on specific periods, like recent experiments or historical baselines. Filter by tag: Click any tag chip on an experiment row to instantly filter the list to runs with that tag. You can also add a Tags column via Display > Columns to see tags for each run at a glance. To filter by tag in a query, use BTQL’s INCLUDES operator:

filter: tags INCLUDES 'my-tag'

Filter with SQL: Select Filter and use the Basic tab for common filters, or switch to SQL to write more precise SQL queries based on criteria like score thresholds, time ranges, or experiment names. Common filtering examples:

-- Filter by time range
WHERE created > '2024-01-01'

-- Filter by score threshold
WHERE scores.Accuracy > 0.8

-- Filter by experiment name pattern
WHERE name LIKE '%baseline%'

-- Combine multiple conditions
WHERE created > now() - interval 7 day
  AND scores.Factuality > 0.7

Filter states are persisted in the URL, allowing you to bookmark or share specific filtered views of experiment runs.

Analyze per-row performance

See how individual rows perform:

Select a row in the dataset table.
In the right panel, select Runs.
Review the row’s metrics across experiments.

This view only shows experiments that set the origin field in eval traces.

Look for patterns:

Consistently low scores suggest ambiguous expectations.
Failures across experiments indicate edge cases.
High variance suggests instability.

Next steps

Run more evaluations to expand the dataset’s coverage.
Edit records that surface as problematic.

Use datasets in evaluations

Export annotated data

⌘I

​View experiment runs

​Filter experiment runs

​Analyze per-row performance

​Next steps

View experiment runs

Filter experiment runs

Analyze per-row performance

Next steps