Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Four ways to analyze eval data: experiment comparison, Loop queries, the Braintrust MCP server, and manual filtering in the UI.
Running an eval gives you data. Analyzing that data is where you find insights. Braintrust provides several tools for this: experiment comparison, Loop queries, the MCP server, and manual filtering in the UI.
Experiment comparison is the core analysis tool. It answers the question: "Did this change actually help?"
Open the Experiments tab in your project. The table shows every eval run with aggregate metrics, including score averages, error counts, duration, token usage, and cost. To compare two experiments, select them and choose Compare. Braintrust shows side-by-side diffs so you can see exactly which test cases improved, regressed, or stayed the same.
The mechanics of running comparisons were covered in module 4. The key here is making comparison a habit. Every time you change a prompt, swap a model, or adjust a scorer, compare the new experiment against the previous one.
Loop is Braintrust's natural language interface for querying your data. Instead of writing SQL or manually filtering rows, you ask questions in plain English:
Loop translates your question into a query, runs it, and returns a summary. This is useful for quick exploratory analysis without leaving the Braintrust UI.
The Braintrust MCP server lets you query your experiment data from any MCP-compatible tool, including Claude, Cursor, or your own CLI. It brings your eval data into your development environment so you can ask questions while you code.
Enable it under Settings > MCP Servers. The server exposes tools like search_docs, sql_query, summarize_experiment, and list_recent_changes. You can also run SQL queries directly and let the server infer the schema for you.
From an MCP client, you can run queries like:
Using the Braintrust MCP server, show me the inputs where
module_7_concise_persona scored below 50% on brand alignment.
This makes experiment data accessible anywhere you work, not just in the Braintrust UI.
The experiment rows view lets you inspect individual test cases directly. You can:
For example, filtering for inputs that contain "refund" shows only the test cases about refunds, so you can evaluate how your system handles that specific topic. Sorting by score (ascending) surfaces your worst-performing cases first, which is where you'll get the most improvement per fix.
In the next module, you'll build a multi-turn chat application with production logging, using init_logger, wrap_openai, and @traced.