Score every agent step, every run.

Every tool call and reasoning step logs as a span. Score them, compare experiments, and find exactly which step caused the regression.

Free to start · No credit card · 5-min setup

Eval run · experiment-183.21s total · $0.009
research_agenttask3.21s
├─search_webtool0.61s

8 results · query: "Q3 earnings"

├─read_documenttool0.38s

2,841 tokens extracted

├─chat_completionllm1.94s

gpt-4o · 3,102 tokens · $0.009

├─Factualityscore

0.96 · pass

└─Task completionscore

0.91 · pass

search 0.61s · read 0.38s · llm 1.94s · overhead 0.28s

Works with your stack. 50+ integrations, including:

OpenAI
Anthropic
Google
Meta
Mistral
xAI
OpenTelemetry
LangChain
CrewAI
Vercel AI SDK
LlamaIndex
Mastra

Run evals in your existing workflow.

Run agent evals from code or the UI. Iterate in the playground without touching code.

Experiments view showing agent runs and scores

Code, CLI, or UI. Your call

Define your agent task and test cases in code, run from the terminal, or build evals in the UI. Every span logs automatically.

Playground showing agent prompt editing with dataset rows

Iterate fast without touching code

Adjust prompts, swap models, and replay your test cases without touching code. Iterations stay linked to your experiments.

Real results from real teams.

<24hrs

To deploy a new frontier model

<10 min

Eval turnaround

50% → 90%+

Accuracy improvement

45x

More feedback

Notion, Dropbox, Zapier, and Coursera use Braintrust to ship better AI.

Evaluation built for AI agents.

Every step as a span

Tool calls and retrieval steps nest as child spans. Open any run and see the full decision path across services (LangGraph, CrewAI, AutoGen, and more).

Trace-level scoring and experiment diffs

Score at the trace level: factuality, task completion, tool use accuracy, groundedness. Compare experiments and see exactly which step caused the regression.

Production traces become eval datasets

Tag a failing trace and it goes straight into a dataset. The format is the same in production and in evals. The traces you debug today are the tests you ship tomorrow.

Customer spotlight

“Braintrust helps us ship AI agents customers actually trust.”

Mohsen Sardari, VP Engineering at Bill

Get a demo

Stop shipping agents on vibes

First agent eval live in minutes.

Free to start · No credit card required

Start free