Skip to main content
Vitest is a test runner for JavaScript and TypeScript. Braintrust integrates with Vitest to run your tests as evals. Braintrust will create experiments from each test suite and each test will become traced spans.

Setup

Install the braintrust package alongside Vitest:
npm install braintrust
Set your API key as an environment variable:
export BRAINTRUST_API_KEY=<your-api-key>

Separate evals from unit tests

This is optional — eval files are just regular Vitest files and can live anywhere in your project. Evals can run slower, and log results to Braintrust. A common convention is a .eval.ts suffix or a dedicated evals/ directory, with a separate Vitest config that targets them:
vitest.eval.config.ts
import { defineConfig } from "vitest/config";

export default defineConfig({
  test: {
    include: ["**/*.eval.ts"],
    testTimeout: 30000,  // LLM calls need more time than typical unit tests
  },
});
Then run evals separately from your unit tests:
# Unit tests (fast, no LLM calls)
npx vitest run

# Evals (slower, logs to Braintrust)
npx vitest run --config vitest.eval.config.ts

Run your first eval

Call wrapVitest once at the top of your test file, passing in the Vitest globals. Use the returned object in place of the standard test, describe, and expect.
my-eval.eval.ts
import * as vitest from "vitest";
import { wrapVitest } from "braintrust";

const { test, expect, describe, afterAll } = wrapVitest(
  vitest,
  { projectName: "my-project" },
);

describe("My eval suite", () => {
  test(
    "basic check",
    {
      input: { prompt: "What is 1 + 1?" },
      expected: "2",
    },
    async ({ input, expected }) => {
      const output = await myModel(input.prompt);
      expect(output).toBe(expected);
      return output;
    },
  );
});
Run it with the eval config:
npx vitest run --config vitest.eval.config.ts
After the suite finishes, a summary prints to your terminal and results are available in Braintrust.

Example: evaluating an LLM summarizer

This example evaluates a function that summarizes news articles. It loads a dataset from Braintrust, attaches multiple scorers, and logs structured outputs per test case.
summarizer.eval.ts
import * from "vitest";
import { wrapVitest, initDataset } from "braintrust";
import { Factuality, EmbeddingSimilarity } from "autoevals";
import { summarize } from "../src/summarizer";

const { test, expect, describe, afterAll, logOutputs } = wrapVitest(
  vitest,
  { projectName: "news-summarizer" },
);

// Load dataset
const articles = await initDataset({
  project: "news-summarizer",
  dataset: "articles-v2",
}).fetchedData();

// custom scorer
function containsExpected({
  output,
  expected,
}: {
  output: unknown;
  expected?: unknown;
}) {
  return {
    name: "contains_expected",
    score:
      typeof output === "string" &&
      typeof expected === "string" &&
      output.includes(expected)
        ? 1.0
        : 0.0,
  };
}

describe("Summarizer quality", () => {
  // `data` fans out into one traced test case per record
  test(
    "summarizes accurately",
    {
      data: articles,
      scorers: [
        Factuality,
        EmbeddingSimilarity,

        // Custom scorer for summaries that are too long
        ({ output, input }) => ({
          name: "brevity",
          score: output.length <= input.article.length * 0.3 ? 1 : 0.5,
          metadata: {
            output_chars: output.length,
            input_chars: input.article.length,
          },
        }),
      ],
    },
    async ({ input }) => {
      const summary = await summarize(input.article);

      // Log additional fields in braintrust
      logOutputs({ summary, test: "test-key" });

      // Named expects will log a score for each assertion
      expect(summary, "not_empty").not.toBe("");

      // Return value is passed to scorers as `output`
      return summary;
    },
  );

  test(
    "math with scorer",
    {
      input: { a: 10, b: 3 },
      expected: 7,
      scorers: [
        ({ output, expected }) => ({
          name: "exact_match",
          score: output === expected ? 1.0 : 0.0,
        }),
      ],
    },
    async ({ input, expected }) => {
      const result = input.a - input.b;
      expect(result, "result").toBe(expected);
      return result;
    },
  );

  test(
    "translation",
    {
      input: { task: "Translate 'hello' to Spanish" },
      expected: "hola",
      tags: ["language", "spanish"],
      scorers: [containsExpected],
    },
    async ({ input, expected }) => {
      const { text } = await generateText({
        model: openai("gpt-5-mini"),
        prompt: input.task,
      });
      expect(text.toLowerCase(), "translation").toContain(expected);
      return text.toLowerCase();
    },
  );
});


Terminal output


Braintrust Experiment Summary
──────────────────────────────
Experiment: summarizer-quality-2025-02-27
Project:    news-summarizer

Scores
  Factuality          0.87
  EmbeddingSimilarity 0.91
  brevity             0.73
  not_empty           1.00
  pass                0.67

View results: https://www.braintrust.dev/...

Key concepts

wrapVitest

Wraps Vitest’s test, describe, and expect with Braintrust tracking.
import * as vitest from "vitest"
const { test, expect, describe, afterAll } = wrapVitest(
  vitest,
  {
    projectName: "my-project",  // optional: groups experiments under a project
    displaySummary: true,       // optional: print summary after tests (default: true)
  },
);

Experiments and suites

Each describe creates one Braintrust experiment. Braintrust appends a timestamp to make each run unique. The project groups experiments together and defaults to the suite name if projectName is not set in the config. Results are pushed to Braintrust regardless of whether individual tests pass or fail, so every run is recorded.

Test configuration

test accepts an optional config object between the name and the test function:
test(
  "test name",
  {
    input: any,          // logged as span input
    expected: any,       // logged as span expected
    metadata: object,    // extra fields logged to the span
    tags: string[],      // searchable labels in Braintrust
    scorers: Scorer[],   // automatically score the return value
    data: Record[],      // expand into one test per record
  },
  async ({ input, expected, metadata }) => {
    return myFunction(input);
  },
);

Scorers

A scorer is any function that receives { output, expected, input, metadata } and returns a name and score:
const myScorer = ({ output, expected }) => ({
  name: "exact_match",
  score: output === expected ? 1 : 0,
});
Scorers run after each test, even on failure. Errors inside scorers are caught and logged.
import { Factuality, Levenshtein, EmbeddingSimilarity } from "autoevals";

test("quality", { scorers: [Factuality, Levenshtein] }, async ({ input }) => {
  return await myModel(input.prompt);
});

Logging helpers

Use logOutputs and logFeedback inside a test to log additional data to the current span:
logOutputs({ summary, tokens_used: 412 });

Inline data and dataset support

Define data inline:
test(
  "sentiment",
  {
    data: [
      { input: "great product!", expected: "positive" },
      { input: "terrible experience", expected: "negative" },
    ],
    scorers: [({ output, expected }) => ({ name: "accuracy", score: output === expected ? 1 : 0 })],
  },
  async ({ input }) => classifySentiment(input),
);
Or load from a managed Braintrust dataset:
const data = await initDataset({
  project: "my-project",
  dataset: "my-dataset",
}).fetchedData();

test("eval", { data, scorers: [Factuality] }, async ({ input }) => {
  return myModel(input.prompt);
});
Both approaches will expand into separate test cases and Braintrust spans automatically.

Resources