Datasets

Datasets allow you to collect data from production, staging, evaluations, and even manually, and then use that data to run evaluations and track improvements over time.

For example, you can use Datasets to:

  • Store evaluation test cases for your eval script instead of managing large JSONL or CSV files
  • Log all production generations to assess quality manually or using model graded evals
  • Store user reviewed (👍, 👎) generations to find new test cases

In Braintrust, datasets have a few key properties:

  • Integrated. Datasets are integrated with the rest of the Braintrust platform, so you can use them in evaluations, explore them in the playground, and log to them from your staging/production environments.
  • Versioned. Every insert, update, and delete is versioned, so you can pin evaluations to a specific version of the dataset, rewind to a previous version, and track changes over time.
  • Scalable. Datasets are stored in a modern cloud data warehouse, so you can collect as much data as you want without worrying about storage or performance limits.
  • Secure. If you run Braintrust in your cloud environment, datasets are stored in your warehouse and never touch our infrastructure.

Creating a dataset

Records in a dataset are stored as JSON objects, and each record has three top-level fields:

  • input is a set of inputs that you could use to recreate the example in your application. For example, if you're logging examples from a question answering model, the input might be the question.
  • expected (optional) is the output of your model. For example, if you're logging examples from a question answering model, this might be the answer. You can access expected when running evaluations as the expected field; however, expected does not need to be the ground truth.
  • metadata (optional) is a set of key-value pairs that you can use to filter and group your data. For example, if you're logging examples from a question answering model, the metadata might include the knowledge source that the question came from.

Datasets are created automatically when you initialize them in the SDK.

Inserting records

You can use the SDK to initialize and insert into a dataset:

import { initDataset, Dataset } from "braintrust";
 
const dataset = initDataset("My App", { dataset: "My Dataset" });
for (let i = 0; i < 10; i++) {
  const id = dataset.insert({
    input: i,
    expected: { result: i + 1, error: null },
    metadata: { foo: i % 2 },
  });
  console.log("Inserted record with id", id);
}
 
console.log(await dataset.summarize());

Updating records

In the above example, each insert() statement returns an id. This id can be used to update the record later:

dataset.insert({
  id,
  input: i,
  expected: { result: i + 1, error: "Timeout" },
  metadata: { foo: i % 2 },
});

Deleting records

You can also delete records by id:

await dataset.delete(id);

Managing datasets in the UI

In addition to managing datasets through the API, you can also manage them in the Braintrust UI.

Viewing a dataset

You can view a dataset in the Braintrust UI by navigating to the project and then clicking on the dataset.

Dataset Viewer

From the UI, you can filter records, create new ones, edit values, and delete records. You can also copy records between datasets and from experiments into datasets. This feature is commonly used to collect interesting or anomalous examples into a golden dataset.

Creating a dataset

The easiest way to create a dataset is to upload a CSV file.

Upload CSV

Updating records

Once you've uploaded a dataset, you can update records or add new ones directly in the UI.

Edit record

Labeling records

In addition to updating datasets through the API, you can edit and label them in the UI. Like experiments and logs, you can configure categorical fields to allow human reviewers to rapidly label records.

Write to expected

Using a dataset in an evaluation

You can use a dataset in an evaluation by passing it directly to the Eval() function.

import { initDataset, Eval } from "braintrust";
import { Levenshtein } from "autoevals";
 
Eval(
  "Say Hi Bot", // Replace with your project name
  {
    data: initDataset("My App", { dataset: "My Dataset" }),
    task: async (input) => {
      return "Hi " + input; // Replace with your LLM call
    },
    scores: [Levenshtein],
  },
);

You can also manually iterate through a dataset's records and run your tasks, then log the results to an experiment. Log the ids to link each dataset record to the corresponding result.

import { initDataset, init, Dataset, Experiment } from "braintrust";
 
function myApp(input: any) {
  return `output of input ${input}`;
}
 
function myScore(output: any, rowExpected: any) {
  return Math.random();
}
 
const dataset = initDataset("My App", { dataset: "My Dataset" });
const experiment = init("My App", {
  experiment: "My Experiment",
  dataset: dataset,
});
for await (const row of dataset) {
  const output = myApp(row.input);
  const closeness = myScore(output, row.expected);
  experiment.log({
    input: row.input,
    output,
    expected: row.expected,
    scores: { closeness },
    datasetRecordId: row.id,
  });
}
 
console.log(await experiment.summarize());

You can also use the results of an experiment as baseline data for future experiments by calling the asDataset()/as_dataset() function, which converts the experiment into dataset format (input, expected, and metadata).

import { init, Eval } from "braintrust";
import { Levenshtein } from "autoevals";
 
const experiment = init(
  "My App",
  {
    experiment: "my-experiment",
    open: true,
  },
);
 
Eval(
  "My App",
  {
    data: experiment.asDataset(),
    task: async (input) => {
      return input + 1; // Replace with your LLM call
    },
    scores: [Levenshtein],
  },
);

For a more advanced overview of how to use an experiment as a baseline for other experiments, see Hill climbing.

Logging from your application

To log to a dataset from your application, you can simply use the SDK and call insert(). Braintrust logs are queued and sent asynchronously, so you don't need to worry about critical path performance.

Since the SDK uses API keys, it's recommended that you log from a privileged environment (e.g. backend server), instead of client applications directly.

This example walks through how to track 👍/👎 from feedback:

import { initDataset, Dataset } from "braintrust";
 
class MyApplication {
  private dataset: Dataset | undefined = undefined;
 
  async initApp() {
    this.dataset = await initDataset("My App", { dataset: "logs" });
  }
 
  async logUserExample(input: any, expected: any, userId: string, orgId: string, thumbsUp: boolean) {
    if (this.dataset) {
      this.dataset.insert({
        input,
        expected,
        metadata: { userId, orgId, thumbsUp },
      });
    } else {
      console.warn("Must initialize application before logging");
    }
  }
}

Troubleshooting

Downloading large datasets

If you are trying to load a very large dataset, you may run into timeout errors while using the SDK. If so, you can paginate through the dataset to download it in smaller chunks.