Latest articles

How to eval: The Braintrust way

27 October 2025Braintrust Team

Building AI products feels different from traditional software development. When you tweak a prompt, swap a model, or adjust your retrieval logic, do you actually know if you made things better or worse? For most teams, the honest answer is "not really." The broken feedback loop in AI development turns engineering into guesswork.

This creates a painful reality: teams ship AI on vibes. They deploy changes hoping for improvement, only to discover regressions when users complain. A poorly tuned agent chooses the wrong tool at a critical moment. Hallucinations slip through. Quality degrades silently. Measuring AI quality differs from traditional software testing—without systematic evaluation, these issues only surface in production.

The challenge boils down to a simple question: How do you know if your change improved things?

From vibes to verified

The Braintrust answer: Turn every production trace into measurable improvement through the complete development loop. Evaluation isn't overhead—it's infrastructure that enables fast iteration without breaking things.

The velocity paradox: Notion's 10x improvement

Working with Notion's AI team revealed something counterintuitive. The more rigorously they evaluated, the faster they shipped.

Before implementing systematic evaluation, Notion's approach to their Q&A feature was thorough but slow. Test datasets lived in JSONL files. Human evaluators manually scored outputs. Each change required careful review because the team couldn't quickly measure if they'd improved quality or introduced regressions. The result: 3 issues triaged and fixed per day.

Then, Notion built evaluation directly into their development loop with Braintrust. They created hundreds of datasets testing specific criteria like tool usage and factual accuracy. Automated scoring replaced manual review. Production traces fed directly into test cases.

The team's velocity increased to 30 issues per day—a 10x improvement.

The lesson is clear. Evaluation isn't overhead that slows you down—it's infrastructure that lets you move fast without breaking things. Notion didn't ship faster by working longer hours. They shipped faster because they could measure every change precisely.

The three building blocks

Every eval in Braintrust follows a consistent pattern:

  • Data: The test cases you run against your AI application
  • Task: The code that executes your AI logic
  • Scorers: The functions that measure quality

This scales from simple to complex use cases without adding conceptual overhead. Whether you're testing a single prompt or a multi-agent workflow, the structure remains the same.

typescript


Eval("My AI Feature", {
  data: () => [
    { input: "What's the capital of France?", expected: "Paris" },
    { input: "Largest planet?", expected: "Jupiter" },
  ],
  task: async (input) => {
    return await myAIFunction(input);
  },
  scores: [Factuality],
});

Production to evaluation and back

Traditional AI development is fragmented. You build prompts in one place, test them somewhere else, monitor production in a third tool, and struggle to connect insights back to improvements. These disconnected steps kill velocity and prevent systematic improvement.

Braintrust connects the full cycle. Production logs capture every trace. Add any trace to a dataset with one click. Run offline evals to catch regressions before deployment. Ship changes that pass evaluation. Monitor with online scoring in production. Low-scoring traces feed back into your datasets. The cycle is continuous, measurable, and fast—every production trace becomes a test case, and every test case informs better production behavior.

For product managers: Ship iterations in minutes

Product managers are often shut out of AI development loops. They rely on engineers to test changes, can't measure improvements themselves, and wait days for iteration cycles. This bottleneck slows down everyone.

Braintrust empowers PMs to make AI measurable and move fast without engineering dependencies. The platform is designed for both technical and non-technical teammates to work together in the same environment.

The PM workflow

PMs and engineers work in the same Braintrust environment, viewing the same data and discussing the same metrics. Measuring AI quality starts with replacing "feels better" with data-driven decisions.

Playgrounds let you compare prompt variations, swap models, and adjust parameters side-by-side on real examples. Run your modified prompt against 50 production traces and see exactly how scores shift. No code required.

The best test cases come from real user interactions. Braintrust logs capture every production trace, and you can add any trace to a dataset with one click. Spotted a failure pattern? Tag those examples and build a focused dataset to test the fix.

Human review where it matters

Human review captures subjective quality that automated scorers can't measure. Use keyboard shortcuts to label outputs, add contextual notes, and mark examples for further investigation. These reviews become first-class signals you can filter on and use to gate releases.

Loop accelerates your workflow

Loop is Braintrust's built-in AI agent that automates the time-intensive parts of eval development:

  • Generate datasets tailored to your use case
  • Optimize prompts by analyzing your current context and suggesting better-performing versions
  • Build scorers with custom rubrics that measure specific quality metrics
  • Find patterns in production logs and surface issues automatically

Once a change passes evaluation in the playground, deploy it directly. No waiting for engineering cycles. Prompts and datasets ship to production immediately, giving PMs the velocity to iterate as fast as the product demands.

Writing scorers: Product requirements for AI quality

Writing good scorers is product design, not an afterthought. Before you write a single line of eval code, define what "good" means for your feature. These criteria become your scorers.

Define success criteria upfront

Just like you wouldn't build a feature without clear requirements, you shouldn't build AI without clear success criteria. Start by asking:

  • What qualities do users expect in the output?
  • What would make this response helpful vs. unhelpful?
  • What are the failure modes I need to prevent?

Be specific. "The response should be accurate" is too vague. "The response must cite sources from the provided context and not hallucinate facts" is actionable. These criteria become your scorers.

Two types of scorers

Where possible, implement checks through code. Code-based scorers are reliable, execute quickly, and eliminate variability. Use them for deterministic checks like format validation, length constraints, and schema compliance.

For nuanced qualities that code can't capture—tone, creativity, empathy—use LLM-as-a-judge scorers. Design clear rubrics with explicit instructions and examples of good vs. bad outputs. Use chain of thought to understand scoring decisions. Choose the right model for judging, which may differ from your task model.

Braintrust provides multiple ways to create scorers. The autoevals library offers pre-built scorers for common scenarios like factuality. You can also create custom scorers directly in Braintrust—both code-based and LLM-as-a-judge—through the UI or SDK.

Iterate scorers as your product evolves

Scorer development is ongoing. After running initial evaluations, review low-score outputs to identify missing criteria or edge cases. Refine definitions, add new scorers, and rerun calibration on expanded datasets.

The tight coupling between development, evaluation, and refinement ensures scorers stay aligned with evolving product needs and user expectations.

Evaluating agents

Agent-based systems break tasks into multiple steps: planning, tool selection, execution, synthesis. Some agents are fully autonomous; others follow more structured workflows. Either way, you need to evaluate both end-to-end performance and individual steps.

The complexity challenge

Multi-step, multi-tool complexity means errors can surface at any point. Measuring agent quality requires evaluating both end-to-end performance and individual steps. When your agent chooses a tool, builds arguments, processes results, and synthesizes an answer, debugging requires visibility into every decision.

Key questions for agent evaluation:

  • Did the agent choose the right tool for the task?
  • Was the plan coherent and aligned with the user's goal?
  • Were tool arguments constructed correctly?
  • Did the agent properly utilize tool outputs?
  • Did it gracefully handle unexpected situations?

Offline agent evals

Offline agent evals function like unit tests or integration tests, emphasizing reproducibility. Create deterministic scenarios by stubbing external dependencies with production snapshots. Test specific agent actions in isolation. Assess individual steps including tool calls, parameter accuracy, and responses.

Use traces to capture every intermediate action. Each span in the trace represents a unit of work: an LLM call, a tool invocation, a retrieval step. Well-designed traces make it easy to pinpoint exactly where behavior diverged from expectations.

Online evals: Real-time monitoring

Online agent evaluation continuously monitors performance in production. Evaluate in your actual production environment for accurate insights. Let users provide feedback on agent responses. Implement continuous scoring for key behaviors like hallucinations, tool accuracy, and goal completion. Start by scoring all requests, then adjust sampling based on stability and traffic volume.

Low-scoring examples feed directly into offline evaluations, closing the loop from production back to systematic improvement.

Trace-driven debugging

Traces give you visibility into agent execution. Every intermediate step is visible and evaluable. When an agent fails, you can replay the exact sequence of decisions, tool calls, and outputs that led to the failure.

This enables debugging complex failures by replaying every intermediate action, iterating quickly without blind spots, and automating test expansion by tagging low-score traces into new datasets.

End-to-end evaluation assesses complete task flows for goal success, coherence, and robustness. Single-step evaluation isolates specific decisions to test tool selection, argument construction, or retrieval relevance in isolation. Both strategies are necessary.

Why Braintrust: Built for how developers actually work

Most AI tools force you to stitch together fragmented workflows. You build in one place, test in another, monitor in a third, and manually connect insights back to improvements. This friction slows down everyone.

Braintrust is different. It's purpose-built for the complete AI development loop with features no other platform offers.

Integrated evaluation and deployment

Every production trace becomes a test case. Production logs flow directly into datasets with one click. Spotted an edge case? Add it to your eval suite immediately.

Every test case informs production. Experiments aren't isolated from deployment. They're directly integrated. Changes that pass offline evaluation ship to production seamlessly.

Continuous feedback cycles compound improvements. Online scoring monitors live requests, feeds low-scoring examples back into offline evals, and creates systematic quality improvements over time.

No other platform integrates evaluation and production this tightly. Braintrust connects development, testing, and deployment in a single workflow.

Model-agnostic and CI/CD native

Braintrust works with any LLM provider: OpenAI, Anthropic, Cohere, open-source models, custom endpoints. Compare different models and providers side-by-side. Switch without rewriting infrastructure. Stay flexible as the AI landscape evolves.

GitHub Actions integration brings production-grade CI/CD to AI development. Every pull request automatically runs evals and posts detailed results as comments showing exactly which test cases improved, which regressed, and by how much. Quality gates prevent regressions from reaching production.

Loop and one platform for teams

Loop analyzes your prompts, generates better-performing versions, creates evaluation datasets tailored to your use case, and builds scorers to measure the metrics that matter. It automates the time-intensive parts of eval development so you can focus on building compelling products. Loop isn't a separate tool—it's built directly into Braintrust, with full context on your logs, datasets, experiments, and prompts.

PMs and engineers don't need separate tools. Both work in the same Braintrust environment with the same visibility. Engineers write code-based tests. PMs prototype in the UI. Everyone reviews results, debugs issues, and tracks improvements together in real time. This eliminates handoff delays and translation overhead.

Know exactly what improved

Side-by-side experiment comparisons show score breakdowns, regression detection, and output diffs at the test case level. No vague aggregate metrics. You see precisely which inputs improved, which degraded, and why. Trace-level inspection reveals every intermediate step.

Ship as fast as the industry evolves

AI moves fast. New models drop weekly. Techniques evolve daily. The teams that succeed are those that can iterate faster than the landscape changes.

Braintrust gives you the velocity to keep pace. Go from idea to tested prototype in hours. Deploy with confidence knowing exactly what you're shipping. Catch regressions before users do. Build systematically instead of guessing.

Your first eval

Braintrust evals are straightforward. Here's an example:

typescript


Eval("Customer Support Bot", {
  data: () => [
    {
      input: "How do I reset my password?",
      expected: "Visit Settings > Security > Reset Password",
    },
    {
      input: "What's your refund policy?",
      expected: "30-day money-back guarantee, no questions asked",
    },
  ],
  task: async (input) => {
    // Your AI logic here
    return await generateResponse(input);
  },
  scores: [Factuality],
});

Run it with braintrust eval and you'll see results in your terminal and the full experiment in the Braintrust UI.

Convert production traces to test cases today

Start logging production traces with the Braintrust SDK. Curate datasets from real user interactions by tagging interesting examples. Write scorers that define quality for your use case. Run evals and see exactly what's working and what's not. Iterate fast in playgrounds. Deploy with confidence knowing your changes improved quality.

Resources and next steps

Frequently asked questions

How do I create evals for my AI application?

Start with the three building blocks: define your data (test cases), task (your AI logic), and scorers (quality measurements). Use the Braintrust SDK to create an Eval function, then run braintrust eval to see results.

How do I start evaluating AI systems?

Begin by logging production traces with the Braintrust SDK. Curate a small dataset from real user interactions (10-200 examples). Define what "good" means for your use case by writing scorers. Run your first eval and iterate based on the results. Start small and build from real production data.

How do I measure agent performance?

Agent eval requires evaluating both end-to-end task completion and individual steps. Use offline evals to test specific agent actions in deterministic scenarios. Use online evals to monitor live agent behavior in production. Leverage traces to debug failures by replaying every intermediate decision, tool call, and output.

What's the difference between measuring AI quality and traditional software testing?

Measuring AI quality accounts for non-deterministic outputs and subjective quality criteria. Traditional unit tests check for exact matches, while AI evals use scorers to measure qualities like factuality, tone, and relevance. You need both code-based scorers for deterministic checks and LLM-based scorers for nuanced evaluation.