Skip to main content
Datasets are versioned collections of test cases that you use to run evaluations and track improvements over time. Build datasets from production logs, user feedback, manual curation, or generate them with Loop.

Why use datasets

Datasets in Braintrust have key advantages:
  • Integrated: Use directly in evaluations, explore in playgrounds, and populate from production.
  • Versioned: Every change is tracked, so experiments can pin to specific versions.
  • Scalable: Stored in a modern data warehouse without storage or performance limits.
  • Secure: Self-hosted deployments keep data in your warehouse.

Create datasets

Upload CSV

The fastest way to create a dataset is uploading a CSV file:Upload CSV
Records with an id key are automatically deduplicated by their id value.

Generate with Loop

Ask Loop to create a dataset based on your logs or specific criteria:Generate dataset from logsExample queries:
  • “Generate a dataset from the highest-scoring examples in this experiment”
  • “Create a dataset with the most common inputs in the logs”

Add records manually

Once you’ve created a dataset, add or edit records directly in the UI:Edit record

Dataset structure

Each record has three top-level fields:
  • input: Data to recreate the example in your application (required).
  • expected: Ideal output or ground truth (optional but recommended for evaluation).
  • metadata: Key-value pairs for filtering and grouping (optional).

View and edit datasets

Dataset Viewer From the dataset page, you can:
  • Filter and search records
  • Create custom columns to extract nested values
  • Edit records inline
  • Copy records between datasets
  • Delete individual records or entire datasets

Create custom columns

Extract values from records using custom columns. Use SQL expressions to surface important fields directly in the table.

Label datasets

Configure categorical fields to allow reviewers to rapidly label records. This requires first configuring human review in your project’s Configuration tab. Write to expected

Define schemas

Dataset schemas let you define JSON schemas for input, expected, and metadata fields. Schemas enable:
  • Validation: Ensure records conform to your structure.
  • Form-based editing: Edit records with intuitive forms instead of raw JSON.

Infer from data

Automatically generate schemas from existing data:
  1. Open the schema editor for a field.
  2. Click Infer schema.
  3. The schema generates from the first 100 records.

Enable enforcement

Toggle Enforce in the schema editor to validate all records. When enabled:
  • New records must conform or show validation errors.
  • Existing non-conforming records display warnings.
  • Form editing validates input automatically.
Enforcement is UI-only and doesn’t affect SDK inserts or updates.

Read and filter datasets

Read datasets with the same method used to create them:
const dataset = initDataset("My App", { dataset: "My Existing Dataset" });

// Loads in batches to avoid memory issues
for await (const row of dataset) {
  console.log(row);
}

Filter with BTQL

Use _internal_btql to filter, sort, and limit records:
const dataset = initDataset("My App", {
  dataset: "My Dataset",
  _internal_btql: {
    filter: { btql: "metadata.category = 'premium'" },
    sort: [{ expr: { btql: "created" }, dir: "desc" }],
    limit: 100,
  },
});

Track performance

Monitor how dataset rows perform across experiments:

View experiment runs

See all experiments that used a dataset:
  1. Go to your dataset page.
  2. In the right panel, select Runs.
  3. Review performance metrics across experiments.
Dataset experiment runs

Analyze per-row performance

See how individual rows perform:
  1. Select a row in the dataset table.
  2. In the right panel, select Runs.
  3. Review the row’s metrics across experiments.
This view only shows experiments that set the origin field in eval traces.
Dataset row performance Look for patterns:
  • Consistently low scores suggest ambiguous expectations.
  • Failures across experiments indicate edge cases.
  • High variance suggests instability.

Multimodal datasets

You can store and process images and other file types in your datasets. There are several ways to use files in Braintrust:
  • Image URLs (most performant) - Keep datasets lightweight with external image references.
  • Base64 (least performant) - Encode images directly in records.
  • Attachments (easiest to manage) - Store files directly in Braintrust.
  • External attachments - Reference files in your own object stores.
For large images, use image URLs to keep datasets lightweight. To keep all data within Braintrust, use attachments. Attachments support any file type including images, audio, and PDFs.
import { Attachment, initDataset } from "braintrust";
import path from "node:path";

async function createPdfDataset(): Promise<void> {
  const dataset = initDataset({
    project: "Project with PDFs",
    dataset: "My PDF Dataset",
  });
  for (const filename of ["example.pdf"]) {
    dataset.insert({
      input: {
        file: new Attachment({
          filename,
          contentType: "application/pdf",
          data: path.join("files", filename),
        }),
      },
    });
  }
  await dataset.flush();
}

createPdfDataset();

Use in evaluations

Pass datasets directly to Eval():
import { initDataset, Eval } from "braintrust";
import { Levenshtein } from "autoevals";

Eval("Say Hi Bot", {
  data: initDataset("My App", { dataset: "My Dataset" }),
  task: async (input) => {
    return "Hi " + input;
  },
  scores: [Levenshtein],
});

Next steps