AI agent evaluation

Score tool calls, reasoning steps, and outputs across every agent run

Every tool call and reasoning step logs as a span. Score them, compare experiments, and find exactly which step caused the regression.

Free to start · No credit card · 5-min setup

Evaluating for a team? Talk to us →

Eval run · experiment-183.21s total · $0.009
research_agenttask3.21s
├─search_webtool0.61s

8 results · query: "Q3 earnings"

├─read_documenttool0.38s

2,841 tokens extracted

├─chat_completionllm1.94s

gpt-4o · 3,102 tokens · $0.009

├─Factualityscore

0.96 · pass

└─Task completionscore

0.91 · pass

search 0.61s · read 0.38s · llm 1.94s · overhead 0.28s

Trusted by AI teams at

Watch video
Read story
Watch video
Watch video
Read story
Watch video

Built around your eval workflow

Run agent evals from code or the UI. Iterate in the playground without touching code.

Experiments view showing agent runs and scores

Run evaluations with Eval(), CLI, or UI

Define your agent task and test cases in code, run from the terminal, or build evals in the UI. Every span logs automatically.

Playground showing agent prompt editing with dataset rows

Use playgrounds for rapid iteration

Adjust prompts, swap models, and replay your test cases without touching code. Iterations stay linked to your experiments.

Works with

LangChainLangGraphCrewAIAutoGenOpenAI Agents SDKGoogle ADKVercel AI SDKMastraPydantic AILlamaIndexTemporalOpenTelemetry+ more

Sign up free and we'll give you instructions to get set up with your stack in minutes.

From teams using Braintrust

<24hrs

To deploy a new frontier model

<10 min

Eval turnaround

50% → 90%+

Accuracy improvement

45x

More feedback

Notion, Dropbox, Zapier, and Coursera use Braintrust to ship better AI.

Built for agentic evaluation

Every step as a span

Tool calls and retrieval steps nest as child spans. Open any run and see the full decision path across services (LangGraph, CrewAI, AutoGen, and more).

Trace-level scoring and experiment diffs

Score at the trace level: factuality, task completion, tool use accuracy, groundedness. Compare experiments and see exactly which step caused the regression.

Production traces become eval datasets

Tag a failing trace and it goes straight into a dataset. The format is the same in production and in evals. The traces you debug today are the tests you ship tomorrow.

Customer spotlight

“Braintrust helps us ship AI agents customers actually trust.”

Mohsen Sardari, VP Engineering at Bill

Talk to us about agent evaluation →

Stop shipping agents on vibes

Set up your first agent eval in minutes

Free to start · No credit card required

Get started free