ABCDEFGHIJKLMNOPQRSTUVWXYZ

The EvalManifesto

Before Braintrust, I led the AI team at Figma. And before that, I founded Impira, a startup focused on ML-powered document extraction. At both companies, the same pattern kept repeating: we'd change a model or tweak a prompt, something would silently break in production, and we'd discover it later through user pain, not through our tools. To fix this, we had to get really good at one thing: evals.

We built internal systems to continuously evaluate our models and prompts, understand regressions, and tie real user behavior back into how we measured quality. But once you start building evals, you immediately run into the next problem: where does the evaluation data come from? You need observability to know what's actually happening in production. You need evals to know whether your AI system is good. And you need a tight feedback loop between the two if you want to move fast without breaking your users. At both companies, we built this stack from scratch: observability, evals, feedback loops, better products.

Then, in a conversation with Elad Gil, one of our now-investors, something clicked. He asked: "Why did you build the same internal tooling twice?" That question exposed a big gap in the ecosystem.

Every serious AI product team was quietly reinventing the same internal tools for observability and evals. None of it was off-the-shelf. None of it was productized in a way that teams could just pick up and use. So we started Braintrust.

A lot has changed since then. Models keep improving. Agents have exploded. Everyone can build an app now with just a few prompts. Some of this was predictable, much of it was not. But even I've been surprised at how fast and how significantly all of this has changed.

We've learned some things, both in building our own product and from working with customers. We've also developed some strong opinions about what to do, what not to do, and what it all means. I've benefited throughout my career from folks who pass on their knowledge and try to make the industry as a whole smarter. So to return the favor we're publishing everything we know about evals and AI observability.

These are our beliefs about evals. These are the terms and definitions you need to know to do them right. And these are the best practices for building evals yourself.

Ankur Goyal

Principles

Evals and observability are AI infrastructure

The eval is as important as the model, harness, agent, etc. It should be treated as part of the stack, not an afterthought.

If you’re building AI products, you need to consider the infrastructure stack the same way you would for any software. What cloud provider, database, what IaC framework? What experiments, which scorers, which rubrics?

AI is now in production, supporting products used by millions. We are past the point of toy demos and neat experiments. Serious products need serious infrastructure. And real infrastructure is observable.

AI needs to be tested, not monitored

Evals are experiments that go beyond standard performance monitoring, and require a new set of products and approaches.

Every AI product, whether it be an agent, a chatbot, or an automated workflow, needs to be developed with success criteria in mind from inception, so it can be properly tested once in production.

AI products win or lose based not on uptime or latency, but on measures of quality that cannot be reduced to simple metrics.

Testing and experimentation are the new observability paradigms.

AI observability is different than traditional observability

AI observability exists because AI products fail differently than traditional software, and improving them requires a fundamentally different set of tools.

AI observability is the infrastructure that lets you see what your AI system did, measure whether it was good, and systematically improve it. It connects three workflows that have historically lived in separate tools: tracing production behavior, evaluating output quality, and iterating on system configuration.

The reason these workflows need to be connected is that AI improvement is a closed loop. You observe production traces, identify quality issues, build evaluation datasets from real interactions, run experiments to test changes, and deploy improvements that you then observe again.

Traditional observability answers "is the system operational?" AI observability answers "is the system producing good outputs, and how do we make them better?"

Traces are the atomic unit of AI observability

Traces are AI product behavior, and need to be at the center of any AI observability stack.

Traces capture inputs and outputs, prompt state, model versions, tool calls, retrieved context, and control flow decisions. This information goes beyond simple performance metrics like latency or error-rate.

Traces reflect what your AI product was doing at a given time, and is useful precisely because it may do something different the next time around.

AI observability necessitates new database architecture

AI workloads are vastly different than traditional apps, and traces reflect massive amounts of data.

Database entries are a mixture of text, prompts, tool calls, and model outputs. Queries are exploratory instead of predefined dashboards.

AI is built with mutable data that requires a non-static approach. You need to annotate traces, re-score them, and re-run evals against historical context.

Existing databases are too slow and too expensive, and don’t understand AI structure.

Observability and product development are no longer separate

Evals are the new PRD, and should be considered at the start of any product development lifecycle.

If you don’t know what success means for an AI product, and can’t define the datasets for measuring this success, then you aren’t properly building for AI.

This doesn’t stop once the product is live and in production.

Production data feeds into evals, which in turn feed into new product capabilities.

Observe, eval, optimize. The feedback loop is continuous and ever changing.

Great AI products are built by cross-functional teams

Evals are how these teams collaborate and iterate.

Software engineers ship and debug AI features.

PMs define golden datasets, success criteria, and evals.

Data scientists and ML engineers plug into the same system to add rigor and depth.

Subject matter experts validate ground-truth for product and engineering teams.

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.

Start building

Encyclopedia Evalica →