How to analyze your eval results

Four ways to analyze eval data: experiment comparison, Loop queries, the Braintrust MCP server, and manual filtering in the UI.

Beyond running evals

Running an eval gives you data. Analyzing that data is where you find insights. Braintrust provides several tools for this: experiment comparison, Loop queries, the MCP server, and manual filtering in the UI.

Comparing experiments

Experiment comparison is the core analysis tool. It answers the question: "Did this change actually help?"

Open the Experiments tab in your project. The table shows every eval run with aggregate metrics, including score averages, error counts, duration, token usage, and cost. To compare two experiments, select them and choose Compare. Braintrust shows side-by-side diffs so you can see exactly which test cases improved, regressed, or stayed the same.

The mechanics of running comparisons were covered in module 4. The key here is making comparison a habit. Every time you change a prompt, swap a model, or adjust a scorer, compare the new experiment against the previous one.

Loop

Loop is Braintrust's natural language interface for querying your data. Instead of writing SQL or manually filtering rows, you ask questions in plain English:

  • "Which inputs consistently scored below 100%?"
  • "What's the average brand alignment score across all experiments?"
  • "Show me the test cases where the output mentions refunds."

Loop translates your question into a query, runs it, and returns a summary. This is useful for quick exploratory analysis without leaving the Braintrust UI.

The Braintrust MCP server

The Braintrust MCP server lets you query your experiment data from any MCP-compatible tool, including Claude, Cursor, or your own CLI. It brings your eval data into your development environment so you can ask questions while you code.

Enable it under Settings > MCP Servers. The server exposes tools like search_docs, sql_query, summarize_experiment, and list_recent_changes. You can also run SQL queries directly and let the server infer the schema for you.

From an MCP client, you can run queries like:

Using the Braintrust MCP server, show me the inputs where
module_7_concise_persona scored below 50% on brand alignment.

This makes experiment data accessible anywhere you work, not just in the Braintrust UI.

Manual exploration and filtering

The experiment rows view lets you inspect individual test cases directly. You can:

  • Filter by score range to find all cases below a threshold
  • Search for specific keywords in the input or output
  • Sort by duration, score, or token count
  • Group results by metadata fields

For example, filtering for inputs that contain "refund" shows only the test cases about refunds, so you can evaluate how your system handles that specific topic. Sorting by score (ascending) surfaces your worst-performing cases first, which is where you'll get the most improvement per fix.

What's next

In the next module, you'll build a multi-turn chat application with production logging, using init_logger, wrap_openai, and @traced.

Further reading

Trace everything