Before Braintrust, I led the AI team at Figma. And before that, I founded Impira, a startup focused on ML-powered document extraction. At both companies, the same pattern kept repeating: we'd change a model or tweak a prompt, something would silently break in production, and we'd discover it later through user pain, not through our tools. To fix this, we had to get really good at one thing: evals.
We built internal systems to continuously evaluate our models and prompts, understand regressions, and tie real user behavior back into how we measured quality. But once you start building evals, you immediately run into the next problem: where does the evaluation data come from? You need observability to know what's actually happening in production. You need evals to know whether your AI system is good. And you need a tight feedback loop between the two if you want to move fast without breaking your users. At both companies, we built this stack from scratch: observability, evals, feedback loops, better products.
Then, in a conversation with Elad Gil, one of our now-investors, something clicked. He asked: "Why did you build the same internal tooling twice?" That question exposed a big gap in the ecosystem.
Every serious AI product team was quietly reinventing the same internal tools for observability and evals. None of it was off-the-shelf. None of it was productized in a way that teams could just pick up and use. So we started Braintrust.
A lot has changed since then. Models keep improving. Agents have exploded. Everyone can build an app now with just a few prompts. Some of this was predictable, much of it was not. But even I've been surprised at how fast and how significantly all of this has changed.
We've learned some things, both in building our own product and from working with customers. We've also developed some strong opinions about what to do, what not to do, and what it all means. I've benefited throughout my career from folks who pass on their knowledge and try to make the industry as a whole smarter. So to return the favor we're publishing everything we know about evals and AI observability.
These are our beliefs about evals. These are the terms and definitions you need to know to do them right. And these are the best practices for building evals yourself.
Ankur Goyal