CI/CD for AI

Ship AI knowing what actually improved

Run evals on every pull request, track LLM cost per experiment, and block merges when quality drops.

Free to start · No credit card · 5-min setup

GPT 5.2
Claude 4.5 Opus
Gemini 3 Pro
Mistral Large
% Score diff per edit
% Score diff
% Tool usage
% Accuracy
52.51%AVG
22
58.44%AVG
33
100%AVG
2
87.3%AVG
11
19.61%+33%
22
37.72%+21%
22
75%+25%
1
92.1%+4.8%
2
28.8%+24%
22
53.97%+4%
22
99.6%
85.0%-2.3%
1
19.84%+33%
22
37.08%+21%
22
75%+25%
1
78.2%-9.1%
2
14.7%
36.75%
100%
95.0%
37.0%-22.3%
37.0%-0.3%
99.8%
94.5%-0.5%
16.2%-1.5%
8.1%+28.7%
99.5%
96.2%+1.2%
29.5%-14.8%
44.3%-7.6%
98.6%
91.0%-4.0%
94.5%
94.5%
100%
100%
31.0%+63.5%
93.1%+1.4%
93.8%
98.5%-1.5%
31.7%+62.8%
95.0%-0.5%
96.1%
99.2%-0.8%
0.0%+94.5%
0.0%+94.5%
0.0%+100.0%
1
0.0%+100.0%
1

Trusted by AI teams at

Watch video
Read story
Watch video
Watch video
Read story
Watch video

From zero to gated deploys in minutes

One action. Automatic scoring. Merges blocked when quality drops.

1
data = [
  {
    "input": "Summarize this",
    "expected": "...",
  },
]
Write your dataset
A dataset is inputs and expected outputs. Start with 10 examples and grow from there.
2
from braintrust import Eval
from autoevals import Factuality

Eval(
  "My LLM",
  data=data,
  task=my_llm,
  scores=[Factuality],
)
Run Eval()
Point it at your LLM and a scorer. Braintrust handles parallelization, storage, and versioning automatically.
3
- name: Run evals
  uses: braintrustdata/eval-action@v1
  with:
    api_key: ${{ secrets.BRAINTRUST_API_KEY }}
    runtime: python
Gate the PR
Add one step to your GitHub Actions workflow. Merges block automatically when scores drop below your threshold.

What changes when Braintrust is part of your workflow

10x

Faster issue resolution

<10 min

Eval turnaround

25%

Accuracy improvement

45x

More feedback

Notion, Dropbox, Zapier, and Coursera use Braintrust to ship better AI. Get started free →

Works how your team works

Engineers write evals in code. CI runs them on every PR. Everyone sees scores, cost, and regressions in one place.

For engineers

- name: Run evals
  uses: braintrustdata/eval-action@v1
  with:
    api_key: ${{ secrets.BRAINTRUST_API_KEY }}
    runtime: python
    threshold: 0.85
    fail_on_regression: true

One step in your CI workflows. Evals run on every PR, scores post as a check, and the merge blocks if quality drops.

For leads & PMs

Braintrust eval results table with scores

Every experiment shows scores, cost, and latency against the baseline. No digging through logs to know if a model swap was worth it.

Built for AI CI/CD from the start

Block merges on regression

Set score thresholds per experiment. When a PR drops below them, checks fail and the merge blocks. No manual review required.

LLM cost tracking per experiment

Every eval logs token usage and cost automatically. Compare cost across model versions and prompt changes before anything reaches production.

Diff any two experiments

Pick any two runs and see exactly which inputs got better or worse, side by side. Know what changed before you ship it.

What our customers say

“We can run hundreds to thousands of experiments with Braintrust.”

Josh Clemm, VP of Engineering at Dropbox

Stop shipping on vibes

Set up your first eval in minutes

Free to start · No credit card required

Get started free