AI testing

Catch regressions before they reach production

Write evals, run them in CI, and know exactly what regressed and what improved.

Free to start · No credit card · 5-min setup

AcmecorpCustomer support agentPlaygrounds

Customer support bot

Base task
system
You are a helpful customer service assistant for an e-commerce platform. Be friendly, concise, and professional.
user
{{input}}
Comparison task
system
user
{{input}}
Comparison task
system
user
{{input}}

Trusted by AI teams at

Watch video
Read story
Watch video
Watch video
Read story
Watch video

From zero to a passing eval suite in minutes

Real test cases. Real scorers. Regressions caught before the merge.

1
# Start with 10 examples
data = [
  {
    "input": "Summarize this",
    "expected": "...",
  },
  # add from prod traces
]
Write your evals
Build a dataset of inputs and expected outputs. Start with 10 real examples. Pull more from production traces as you find edge cases.
2
Eval(
  "Customer support",
  data=data,
  task=my_llm,
  scores=[Factuality],
  threshold=0.85,
)
Define what “passing” means
Set a score threshold. Braintrust runs every eval, scores each output, and tells you exactly what passed, failed, or regressed versus last time.
3
- name: AI regression tests
  uses: braintrustdata/eval-action@v1
  with:
    api_key: ${{ secrets.BRAINTRUST_API_KEY }}
    fail_on_regression: true
Run in CI like any other test
Every PR runs the full eval suite. Merges block when scores drop. Same workflow as unit tests but built for LLMs.

What changes when Braintrust is part of your workflow

10x

Faster issue resolution

<10 min

Eval turnaround

25%

Accuracy improvement

45x

More feedback

Notion, Dropbox, Zapier, and Coursera use Braintrust to ship better AI. Get started free →

Works how your team works

Engineers write evals in code and run them in CI. PMs and domain experts review results and curate test cases in the UI.

For engineers

import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("Customer support", {
  data: () => testCases,
  task: async (input) =>
    myLLM(input),
  scores: [Factuality],
  threshold: 0.85,
});

Write evals in any language. Version-controlled, parallelized, and composable with your existing test suite. Run locally or in CI with one command.

For PMs & domain experts

Braintrust human review panel with classification and relevancy scoring

Review failing and regressed cases in the UI without touching code. Add new test cases from production examples with one click.

Built for AI testing from the start

Testing that actually scales

Write one eval function and run it against thousands of test cases in parallel. Braintrust handles scheduling, storage, and comparison against every previous run.

Regression testing in CI

Every PR runs your full eval suite. Scores are compared against the baseline automatically. Merges block when quality drops.

Test cases from real failures

When something breaks in production, add it to a dataset and it becomes a test case. Your eval suite grows from actual bugs.

What our customers say

“Braintrust helped us identify several patterns that we wouldn't have found otherwise.”

Luis Héctor Chávez, CTO at Replit

Stop shipping on vibes

Set up your first eval in minutes

Free to start · No credit card required

Get started free