With Simon Last, Co-Founder
Notion, a connected workspace for documents, knowledge bases, and project management systems, is the go-to platform for millions of users, from agile startups to global enterprises. Notion’s commitment to customer experience has always set them apart — but with the rise of generative AI, they saw an opportunity to push their platform even further.

Co-founders Ivan Zhao and Simon Last were early in recognizing the potential of generative AI and started experimenting with large language models (LLMs) soon after GPT-2 launched in 2019. Given that context, it’s no surprise that Notion was one of the first teams to integrate GPT-4 into their product:
Today, those features (and much more!) make up Notion AI, which powers four essential capabilities: searching workspaces, generating and editing tailored documents, analyzing PDFs and images, and answering questions using information from both your workspace and the web. However, as the team rolled out these features, they realized their existing workflows for evaluating AI products weren’t up to the challenge.
In this case study, we’ll explore how the team at Notion evolved their evaluation workflow— dramatically improving their AI development speed and accuracy— by partnering with Braintrust.
One of Notion’s earliest AI innovations was its Q&A feature. With Q&A, users could ask complex questions, and AI would pull data directly from their Notion pages to provide insightful responses. It was a magical customer experience — despite the broad and unstructured input and output space, it almost always returned a helpful response.
Behind this magic was an immense engineering effort. Notion’s AI team did a ton of work to ensure Q&A could understand diverse user queries and generate a helpful output. However, the complexity of the product meant that their evaluation process— how they tested and improved the AI’s performance— needed an upgrade.
Initially, Notion’s evaluation workflow was manual and intensive:
JSONL files in a git repo, making them difficult to manage, version, and collaborate on.While this process helped Notion successfully launch Q&A in beta in late 2023, it was clear they needed a more scalable, efficient approach to take their AI products to the next level. That’s where Braintrust came in.
Partnering with Braintrust transformed Notion’s eval workflow, enabling faster iterations and ultimately higher quality features in production. Here’s an overview of the new workflow:

The first step is to decide on an improvement. It could be adding a new feature, fixing an issue based on user feedback, or enhancing existing capabilities. In the newest Notion AI product, users can ask questions and chat about anything — from the work context that lives in their Notion pages, to unrelated ideas and concepts across a wide variety of topics, powered by LLMs.
Instead of manually creating JSONL files, Notion now leverages Braintrust to curate, version, and manage datasets seamlessly. They typically start with 10-20 examples. To curate their datasets, Notion uses examples from real-world usage that are automatically logged in Braintrust, and writes examples by hand. To give you a sense of how important this step is, Notion has hundreds of eval datasets - and the number grows every week.
Notion’s approach is to clearly, often narrowly, define the scope of each dataset and scoring function. For more complex tasks, Notion will use a variety of well-defined evals, each testing for specific criteria. Typically, this is a mix of heuristic and LLM-as-a-judge scorers, along with a healthy dose of human review. Because Braintrust allows users to define their own custom scoring functions, Notion can flexibly test anything — tool usage, factual accuracy, hallucinations, recall, and more.
After making an update, Notion leverages Braintrust to immediately understand how performance changed overall and drill down into specific improvements and regressions. The team follows this rough process:
The team continues the cycle of make an update -> run evals and inspect results until they’re happy with the improvements and are ready to ship — on to the next!
The team at Notion is now able to triage and fix 30 issues per day, compared to just 3 per day using the old workflow— allowing them to deliver even better products, like their new suite of incredible AI features. By redefining their workflow, Notion has set a new standard for how AI product teams can evaluate and improve their generative AI products.
Generative AI is redefining what’s possible in software, and it’s been inspiring to watch Notion leverage AI to push the boundaries of excellent user experience. With Braintrust, Notion has created an evaluation workflow that empowers its AI team to ship faster and build with confidence.
Learn how Braintrust enables rapid iteration, clear visibility into regressions, and the confidence to ship world-class AI features faster.