Skip to main content
Datasets are versioned collections of test cases that you use to run evaluations and track improvements over time. Build datasets from production logs, user feedback, manual curation, or generate them with Loop. After reviewing traces with scores and labels, compile them into structured datasets for evaluation.

Why use datasets

Datasets in Braintrust have key advantages:
  • Integrated: Use directly in evaluations, explore in playgrounds, and populate from production.
  • Versioned: Every change is tracked, so experiments can pin to specific versions.
  • Scalable: Stored in a modern data warehouse without storage or performance limits.
  • Secure: Self-hosted deployments keep data in your warehouse.

Create datasets

Upload CSV/JSON

The fastest way to create a dataset is uploading a CSV or JSON file:
  1. Go to Datasets.
  2. If there are existing datasets, click + Dataset. Otherwise, click Upload CSV/JSON.
  3. Drag and drop your file in the Upload dataset dialog.
  4. Columns automatically map to the input field. Drag and drop them into different categories as needed:
    • Input: Fields used as inputs for your task.
    • Expected: Ground truth or ideal outputs for scoring.
    • Metadata: Additional context for filtering and grouping.
    • Tags: Labels for organizing and filtering. When you categorize columns as tags, they’re automatically added to your project’s tag configuration.
    • Do not import: Exclude columns from the dataset.
    The preview table updates in real-time as you move columns between categories, showing exactly how your dataset will be structured.
  5. Click Import.
If your data includes an id field, duplicate rows will be deduplicated, with only the last occurrence of each ID kept.

Generate with Loop

Ask Loop to create a dataset based on your logs or specific criteria:Generate dataset from logsExample queries:
  • “Generate a dataset from the highest-scoring examples in this experiment”
  • “Create a dataset with the most common inputs in the logs”

Add records manually

Once you’ve created a dataset, add or edit records directly in the UI:Edit record

From user feedback

User feedback from production provides valuable test cases that reflect real user interactions. Use feedback to create datasets from highly-rated examples or problematic cases.See Capture user feedback for implementation details on logging feedback programmatically.To build datasets from feedback:
  1. Filter logs by feedback scores using the Filter menu:
    • scores.user_rating > 0.8 (SQL) or filter: scores.user_rating > 0.8 (BTQL) for highly-rated examples
    • metadata.thumbs_up = false for negative feedback
    • comment IS NOT NULL and scores.correctness < 0.5 for low-scoring feedback with comments
  2. Select the traces you want to include.
  3. Select Add to dataset.
  4. Choose an existing dataset or create a new one.
You can also ask Loop to create datasets based on feedback patterns, such as “Create a dataset from logs with positive feedback” or “Build a dataset from cases where users clicked thumbs down.”

Dataset structure

Each record has three top-level fields:
  • input: Data to recreate the example in your application (required).
  • expected: Ideal output or ground truth (optional but recommended for evaluation).
  • metadata: Key-value pairs for filtering and grouping (optional).

View and edit datasets

Dataset Viewer From the dataset page, you can:
  • Filter and search records
  • Create custom columns to extract nested values
  • Edit records inline
  • Copy records between datasets
  • Delete individual records or entire datasets

Create custom columns

Extract values from records using custom columns. Use SQL expressions to surface important fields directly in the table.

Label datasets

Configure categorical scores to allow reviewers to rapidly label records. See Configure review scores for details. Write to expected

Define schemas

Dataset schemas let you define JSON schemas for input, expected, and metadata fields. Schemas enable:
  • Validation: Ensure records conform to your structure.
  • Form-based editing: Edit records with intuitive forms instead of raw JSON.

Infer from data

Automatically generate schemas from existing data:
  1. Open the schema editor for a field.
  2. Click Infer schema.
  3. The schema generates from the first 100 records.

Enable enforcement

Toggle Enforce in the schema editor to validate all records. When enabled:
  • New records must conform or show validation errors.
  • Existing non-conforming records display warnings.
  • Form editing validates input automatically.
Enforcement is UI-only and doesn’t affect SDK inserts or updates.

Read and filter datasets

Use the filter menu to narrow dataset views, or write SQL queries for complex filtering. See Filter and search for details.

Track performance

Monitor how dataset rows perform across experiments:

View experiment runs

See all experiments that used a dataset:
  1. Go to your dataset page.
  2. In the right panel, select Runs.
  3. Review performance metrics across experiments.
Runs display as charts that show score trends over time. The time axis flows from oldest (left) to newest (right), making it easy to track performance evolution. Dataset experiment runs

Filter experiment runs

To narrow down the list of experiment runs, you can filter by time range or use SQL. Filter by time range: Click and drag across any region of the chart to select a time range. The table below updates to show only experiments in that range. To clear the filter, click clear. This helps you focus on specific periods, like recent experiments or historical baselines. Filter with SQL: Select Filter and use the Basic tab for common filters, or switch to SQL to write more precise SQL queries based on criteria like score thresholds, time ranges, or experiment names. Common filtering examples:
-- Filter by time range
WHERE created > '2024-01-01'

-- Filter by score threshold
WHERE scores.Accuracy > 0.8

-- Filter by experiment name pattern
WHERE name LIKE '%baseline%'

-- Combine multiple conditions
WHERE created > now() - interval 7 day
  AND scores.Factuality > 0.7
Filter states are persisted in the URL, allowing you to bookmark or share specific filtered views of experiment runs.

Analyze per-row performance

See how individual rows perform:
  1. Select a row in the dataset table.
  2. In the right panel, select Runs.
  3. Review the row’s metrics across experiments.
This view only shows experiments that set the origin field in eval traces.
Dataset row performance Look for patterns:
  • Consistently low scores suggest ambiguous expectations.
  • Failures across experiments indicate edge cases.
  • High variance suggests instability.

Multimodal datasets

You can store and process images and other file types in your datasets. There are several ways to use files in Braintrust:
  • Image URLs (most performant) - Keep datasets lightweight with external image references.
  • Base64 (least performant) - Encode images directly in records.
  • Attachments (easiest to manage) - Store files directly in Braintrust.
  • External attachments - Reference files in your own object stores.
For large images, use image URLs to keep datasets lightweight. To keep all data within Braintrust, use attachments. Attachments support any file type including images, audio, and PDFs.
import { Attachment, initDataset } from "braintrust";
import path from "node:path";

async function createPdfDataset(): Promise<void> {
  const dataset = initDataset({
    project: "Project with PDFs",
    dataset: "My PDF Dataset",
  });
  for (const filename of ["example.pdf"]) {
    dataset.insert({
      input: {
        file: new Attachment({
          filename,
          contentType: "application/pdf",
          data: path.join("files", filename),
        }),
      },
    });
  }
  await dataset.flush();
}

createPdfDataset();

Use in evaluations

Pass datasets directly to Eval():
import { initDataset, Eval } from "braintrust";
import { Levenshtein } from "autoevals";

Eval("Say Hi Bot", {
  data: initDataset("My App", { dataset: "My Dataset" }),
  task: async (input) => {
    return "Hi " + input;
  },
  scores: [Levenshtein],
});

Next steps