Skip to main content
Run evaluations directly in your code using the Eval() function, or use the braintrust eval CLI command to run multiple evaluations from files. Integrate with CI/CD to catch regressions automatically.

Run with Eval()

The Eval() function runs an evaluation and creates an experiment:
import { Eval, initDataset } from "braintrust";
import { Factuality } from "autoevals";

Eval("My Project", {
  data: initDataset("My Project", { dataset: "My Dataset" }),
  task: async (input) => {
    // Your LLM call here
    return await callModel(input);
  },
  scores: [Factuality],
  metadata: {
    model: "gpt-4o",
    temperature: 0.7,
  },
});
Running Eval() automatically:
  • Creates an experiment in Braintrust
  • Displays a summary in your terminal
  • Populates the UI with results
  • Returns summary metrics

Run with CLI

Use the braintrust eval command to run evaluations from files:
npx braintrust eval basic.eval.ts
npx braintrust eval [file or directory] ...
The CLI loads environment variables from:
  • .env.development.local
  • .env.local
  • .env.development
  • .env

Watch mode

Re-run evaluations automatically when files change:
npx braintrust eval --watch basic.eval.ts
braintrust eval --watch eval_basic.py

Local testing mode

Run evaluations without sending logs to Braintrust for quick iteration:
npx braintrust eval --no-send-logs basic.eval.ts
braintrust eval --no-send-logs eval_basic.py

Run in CI/CD

Integrate evaluations into your CI/CD pipeline to catch regressions automatically.

GitHub Actions

Use the braintrustdata/eval-action to run evaluations on every pull request:
- name: Run Evals
  uses: braintrustdata/eval-action@v1
  with:
    api_key: ${{ secrets.BRAINTRUST_API_KEY }}
    runtime: node
The action automatically posts a comment with results: action comment Full example workflow:
name: Run evaluations

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'

      - name: Install dependencies
        run: npm install

      - name: Run Evals
        uses: braintrustdata/eval-action@v1
        with:
          api_key: ${{ secrets.BRAINTRUST_API_KEY }}
          runtime: node

Other CI systems

For other CI systems, run evaluations as a standard command:
# Install dependencies
npm install

# Run evaluations
npx braintrust eval evals/
Ensure your CI environment has the BRAINTRUST_API_KEY environment variable set.

Run remotely

Expose evaluations running on remote servers or local machines using dev mode:
npx braintrust eval --dev basic.eval.ts
This allows you to trigger evaluations from the Braintrust UI or API while the code runs in your environment. See Run remote evaluations for details.

Configure experiments

Customize experiment behavior with options:
Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],

  // Experiment name
  experiment: "gpt-4o-experiment",

  // Metadata for filtering/analysis
  metadata: {
    model: "gpt-4o",
    prompt_version: "v2",
  },

  // Maximum concurrency
  maxConcurrency: 10,

  // Trial count for averaging
  trialCount: 3,
});

Run trials

Run each input multiple times to measure variance and get more robust scores. Braintrust intelligently aggregates results by bucketing test cases with the same input value:
Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],
  trialCount: 10, // Run each input 10 times
});

Run local evals without sending logs

Run evaluations locally without creating experiments or sending data to Braintrust:
Eval(
  "Say Hi Bot",
  {
    data: () => [{ input: "David", expected: "Hi David" }],
    task: (input) => "Hi " + input,
    scores: [Factuality],
  },
  {
    noSendLogs: true, // Run locally without creating experiment
  },
);
This is equivalent to passing the --no-send-logs flag with the CLI command.

Next steps