Skip to main content
An experiment is an immutable snapshot of an evaluation run — permanently stored, comparable over time, and shareable across your team. Unlike playground runs, which overwrite previous results for fast iteration, experiments preserve exact results so you can measure improvements, catch regressions, and build confidence in your changes.

Run locally

Run evaluation code locally to create an experiment in Braintrust and return summary metrics, including a direct link to your experiment. See Interpret results for how to read it.
Install the SDK and dependencies:
# pnpm
pnpm add braintrust openai autoevals
# npm
npm install braintrust openai autoevals
Create the eval code:
import { Eval, initDataset } from "braintrust";
import { Factuality } from "autoevals";

Eval("My Project", {
  experimentName: "My experiment",
  data: initDataset("My Project", { dataset: "My dataset" }),
  task: async (input) => {
    // Your LLM call here
    return await callModel(input);
  },
  scores: [Factuality],
  metadata: {
    model: "gpt-5-mini",
  },
});
Run your evaluation with the braintrust eval CLI:
npx braintrust eval my_eval.eval.ts
Use --watch to re-run automatically when files change:
npx braintrust eval --watch my_eval.eval.ts
  • Automatic .env loading — reads .env.development.local, .env.local, .env.development, and .env
  • Multi-file support — pass multiple files or directories: braintrust eval [file or directory] ...
  • TypeScript transpilation — no build step required; the CLI handles it
You can pass a parameters option to make configuration values (like model choice, temperature, or prompts) editable in the playground without changing code. Define parameters inline or use loadParameters() to reference saved configurations. See Write parameters and Test complex agents for details.

Run in UI

Create from scratch

Create and run experiments directly in the Braintrust UI without writing code:
  1. Go to Experiments.
  2. Click + Experiment or use the empty state form.
  3. Select one or more prompts, workflows, or scorers to evaluate.
  4. Choose or create a dataset:
    • Select existing dataset: Pick from datasets in your organization
    • Upload CSV/JSON: Import test cases from a file
    • Empty dataset: Create a blank dataset to populate manually later
  5. Add scorers to measure output quality.
  6. Click Create to execute the experiment.
UI experiments run without a time limit on cloud and on self-hosted deployments running data plane v2.0 or later.

Promote from a playground

Playground runs are mutable — re-running overwrites previous results. When you’ve iterated to a configuration worth keeping, promote it to an experiment to capture an immutable snapshot:
  1. Run your playground.
  2. Select + Experiment.
  3. Name your experiment.
  4. Access it from the Experiments page.
Each playground task maps to its own experiment. Experiments created this way are comparable to any other experiment in your project.

Run in CI/CD

Integrate evaluations into your CI/CD pipeline to catch regressions before they reach production.

GitHub Actions

Use the braintrustdata/eval-action to run evaluations on every pull request:
name: Run evaluations

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'

      - name: Install dependencies
        run: npm install

      - name: Run Evals
        uses: braintrustdata/eval-action@v1
        with:
          api_key: ${{ secrets.BRAINTRUST_API_KEY }}
          runtime: node
The action automatically posts a comment with results to the pull request.

Other CI systems

For other CI systems, run evaluations as a standard shell command:
npx braintrust eval evals/
Ensure your CI environment has the BRAINTRUST_API_KEY environment variable set.

Configure experiments

Customize experiment behavior with options:
Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],

  // Experiment name
  experiment: "gpt-5-mini-experiment",

  // Metadata for filtering/analysis
  metadata: {
    model: "gpt-5-mini",
    prompt_version: "v2",
  },

  // Maximum concurrency
  maxConcurrency: 10,

  // Trial count for averaging
  trialCount: 3,
});

Run without uploading results

Sometimes you want to run your evaluation locally without creating an experiment in Braintrust — while iterating on a new scorer, wiring up a new eval pipeline, or running in an environment without a Braintrust API key. Your tasks and scorers still run and print a summary to your terminal; results just aren’t uploaded.
Via the CLI:
npx braintrust eval --no-send-logs my_eval.eval.ts
Or in code:
Eval("My Project", {
  data: ...,
  task: ...,
  scores: [...],
  noSendLogs: true,
});

Run trials

Run each input multiple times to measure variance and get more robust scores. Braintrust intelligently aggregates results by bucketing test cases with the same input value:
Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],
  trialCount: 10, // Run each input 10 times
});

Enable hill climbing

Hill climbing lets you improve iteratively without expected outputs by using a previous experiment’s output as the expected for the current run. To enable it, use BaseExperiment() in the data field. Autoevals scorers like Battle and Summary are designed specifically for this workflow.
import { Battle } from "autoevals";
import { Eval, BaseExperiment } from "braintrust";

Eval<string, string, string>(
  "Say Hi Bot", // Replace with your project name
  {
    data: BaseExperiment(),
    task: (input) => {
      return "Hi " + input; // Replace with your task function
    },
    scores: [Battle.partial({ instructions: "Which response said 'Hi'?" })],
  },
);
Braintrust automatically picks the best base experiment using git metadata if available or timestamps otherwise, then populates the expected field by merging the expected and output fields from the base experiment. If you set expected through the UI while reviewing results, it will be used as the expected field for the next experiment. To use a specific experiment as the base, pass the name field to BaseExperiment():
import { Battle } from "autoevals";
import { Eval, BaseExperiment } from "braintrust";

Eval<string, string, string>(
  "Say Hi Bot", // Replace with your project name
  {
    data: BaseExperiment({ name: "main-123" }),
    task: (input) => {
      return "Hi " + input; // Replace with your task function
    },
    scores: [Battle.partial({ instructions: "Which response said 'Hi'?" })],
  },
);
When hill climbing, use two types of scoring functions:
  • Non-comparative methods like ClosedQA that judge output quality based purely on input and output without requiring an expected value. Track these across experiments to compare any two experiments, even if they aren’t sequentially related.
  • Comparative methods like Battle or Summary that accept an expected output but don’t treat it as ground truth. If you score > 50% on a comparative method, you’re doing better than the base on average. Learn more about how Battle and Summary work.

Create custom reporters

When you run an experiment, Braintrust logs results to your terminal, and braintrust eval returns a non-zero exit code if any eval throws an exception. Customize this behavior for CI/CD pipelines to precisely define what constitutes a failure or to report results to different systems. Define custom reporters using Reporter(). A reporter has two functions:
import { Reporter } from "braintrust";

Reporter(
  "My reporter", // Replace with your reporter name
  {
    reportEval(evaluator, result, opts) {
      // Summarizes the results of a single reporter, and return whatever you
      // want (the full results, a piece of text, or both!)
    },

    reportRun(results) {
      // Takes all the results and summarizes them. Return a true or false
      // which tells the process to exit.
      return true;
    },
  },
);
Any Reporter included among your evaluated files will be automatically picked up by the braintrust eval command.
  • If no reporters are defined, the default reporter logs results to the console.
  • If you define one reporter, it’s used for all Eval blocks.
  • If you define multiple Reporters, specify the reporter name as an optional third argument to the eval function.

Include attachments

Braintrust allows you to log binary data like images, audio, and PDFs as attachments. Use attachments in evaluations by initializing an Attachment object in your data:
import { Eval, Attachment } from "braintrust";
import { NumericDiff } from "autoevals";
import path from "path";

function loadPdfs() {
  return ["example.pdf"].map((pdf) => ({
    input: {
      file: new Attachment({
        filename: pdf,
        contentType: "application/pdf",
        data: path.join("files", pdf),
      }),
    },
    // This is a toy example where we check that the file size is what we expect.
    expected: 469513,
  }));
}

async function getFileSize(input: { file: Attachment }) {
  return (await input.file.data()).size;
}

Eval("Project with PDFs", {
  data: loadPdfs,
  task: getFileSize,
  scores: [NumericDiff],
});
You can also store attachments in a dataset for reuse across multiple experiments. After creating the dataset, reference it by name in an eval. The attachment data is automatically downloaded from Braintrust when accessed:
import { NumericDiff } from "autoevals";
import { initDataset, Eval, ReadonlyAttachment } from "braintrust";

async function getFileSize(input: {
  file: ReadonlyAttachment;
}): Promise<number> {
  return (await input.file.data()).size;
}

Eval("Project with PDFs", {
  data: initDataset({
    project: "Project with PDFs",
    dataset: "My PDF Dataset",
  }),
  task: getFileSize,
  scores: [NumericDiff],
});
To forward an attachment to an external service like OpenAI, obtain a signed URL instead of downloading the data directly:
import { initDataset, wrapOpenAI, ReadonlyAttachment } from "braintrust";
import { OpenAI } from "openai";

const client = wrapOpenAI(
  new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
  }),
);

async function main() {
  const dataset = initDataset({
    project: "Project with images",
    dataset: "My Image Dataset",
  });
  for await (const row of dataset) {
    const attachment: ReadonlyAttachment = row.input.file;
    const attachmentUrl = (await attachment.metadata()).downloadUrl;
    const response = await client.chat.completions.create({
      model: "gpt-5-mini",
      messages: [
        {
          role: "system",
          content: "You are a helpful assistant",
        },
        {
          role: "user",
          content: [
            { type: "text", text: "Please summarize the attached image" },
            { type: "image_url", image_url: { url: attachmentUrl } },
          ],
        },
      ],
    });
    const summary = response.choices[0].message.content || "Unknown";
    console.log(
      `Summary for file ${attachment.reference.filename}: ${summary}`,
    );
  }
}

main();

Trace your evals

Add detailed tracing to your evaluation task functions to measure performance and debug issues. Each span in the trace represents an operation like an LLM call, database lookup, or API request.
Use wrapOpenAI/wrap_openai to automatically trace OpenAI API calls. See Trace LLM calls for details.
Each call to experiment.log() creates its own trace. Do not mix experiment.log() with tracing functions like traced() - this creates incorrectly parented traces.
Wrap task code with traced() to log incrementally to spans. This example progressively logs input, output, and metrics:
import { Eval, traced } from "braintrust";

async function callModel(input: string) {
  return traced(
    async (span) => {
      const messages = { messages: [{ role: "system", text: input }] };
      span.log({ input: messages });

      // Replace this with a model call
      const result = {
        content: "China",
        latency: 1,
        prompt_tokens: 10,
        completion_tokens: 2,
      };

      span.log({
        output: result.content,
        metrics: {
          latency: result.latency,
          prompt_tokens: result.prompt_tokens,
          completion_tokens: result.completion_tokens,
        },
      });
      return result.content;
    },
    {
      name: "My AI model",
    },
  );
}

const exactMatch = (args: {
  input: string;
  output: string;
  expected?: string;
}) => {
  return {
    name: "Exact match",
    score: args.output === args.expected ? 1 : 0,
  };
};

Eval("My Evaluation", {
  data: () => [
    { input: "Which country has the highest population?", expected: "China" },
  ],
  task: async (input, { span }) => {
    return await callModel(input);
  },
  scores: [exactMatch],
});
This creates a span tree you can visualize in the UI by clicking on each test case in the experiment.

Troubleshooting

If your evaluations are slower than expected when using maxConcurrency, you may be on an older SDK version that flushes logs after every single task completion. Upgrade to TypeScript SDK v3.3.0+ for up to an 8x performance improvement. The SDK now uses byte-based backpressure for better flushing performance.You can tune the flush threshold with the BRAINTRUST_FLUSH_BACKPRESSURE_BYTES environment variable. See Tune performance for all available configuration options.
When the task function throws, the C# eval framework catches the exception, records it on the task span and root span (with ActivityStatusCode.Error), and calls ScoreForTaskException on every scorer instead of Score. The eval continues — no cases are skipped.By default, ScoreForTaskException returns a single score of 0.0. Override it on your IScorer to return a custom fallback score, return an empty list to omit scoring for that case, or re-throw to abort the eval.
#skip-compile
using Braintrust.Sdk.Eval;

sealed class MyScorer : IScorer<string, string>
{
    public string Name => "my_scorer";

    public Task<IReadOnlyList<Score>> Score(TaskResult<string, string> taskResult)
    {
        var matches = taskResult.Result == taskResult.DatasetCase.Expected;
        return Task.FromResult<IReadOnlyList<Score>>([new Score(Name, matches ? 1.0 : 0.0)]);
    }

    // Called instead of Score() when the task function threw.
    // Return [] to skip recording a score; throw to abort the eval.
    public Task<IReadOnlyList<Score>> ScoreForTaskException(
        Exception taskException,
        DatasetCase<string, string> datasetCase)
    {
        // Distinguish between expected and unexpected failures
        if (taskException is TimeoutException)
            return Task.FromResult<IReadOnlyList<Score>>([new Score(Name, 0.0)]);

        return Task.FromResult<IReadOnlyList<Score>>([]); // skip scoring
    }
}
The task span and root eval span both receive an OTel exception event with exception.type, exception.message, and exception.stacktrace attributes, visible in any OTel-compatible backend connected to Braintrust.
When a scorer’s Score method throws, the exception is recorded on that scorer’s span (with ActivityStatusCode.Error and an OTel exception event) and ScoreForScorerException is called as a fallback. Other scorers continue running unaffected.By default, ScoreForScorerException returns a single score of 0.0. Override it to return a custom fallback, return an empty list to omit the score, or re-throw to abort the eval.
#skip-compile
using Braintrust.Sdk.Eval;

sealed class MyScorer : IScorer<string, string>
{
    public string Name => "my_scorer";

    public Task<IReadOnlyList<Score>> Score(TaskResult<string, string> taskResult)
    {
        // ... scoring logic that might throw
        throw new InvalidOperationException("unexpected output format");
    }

    // Called when Score() throws. Other scorers are not affected.
    // Return [] to skip recording a score; throw to abort the eval.
    public Task<IReadOnlyList<Score>> ScoreForScorerException(
        Exception scorerException,
        TaskResult<string, string> taskResult)
    {
        return Task.FromResult<IReadOnlyList<Score>>([new Score(Name, 0.0)]);
    }
}
Score spans are named score:<scorer_name> (e.g. score:my_scorer), making individual scorer traces distinguishable in Braintrust and any connected OTel backend.

Next steps