The eval improvement loop

Close the loop: find problems in production, sample them into a dataset, run a baseline, test a fix, and verify the results. The full eval-driven improvement cycle.

All the assets for this module are available at braintrustdata/eval-101-course/module-14.

Connecting everything

This module ties together everything from the course. You built the chatbot in module 10, logged conversations and scored them with online scoring in modules 11 and 12. In modules 3 through 9, you created datasets and ran evals. Two connections still need to be made:

Logs to dataset. How do you turn production findings into test cases?
Evals back to app. How do you use eval results to improve your application?

The eval improvement loop is the workflow that connects these pieces into a repeatable cycle.

Step 1: Find a problem in production

Start with your production logs. Use the topic maps from the previous module to identify which categories are underperforming. Combine topics with online scoring data to find the intersection of a specific topic and low scores.

For example, you might notice that conversations tagged "Account access and billing issues" have significantly lower brand alignment scores than other categories. That is your signal.

Step 2: Create a dataset from production logs

Filter your logs to the problem topic, then filter further by low scores. Select a representative set of failing cases and create a dataset directly from the filtered logs.

This dataset should include:

Clear failure cases (low scores)
Borderline cases (scores near the threshold)
A few passing cases (to make sure your fix does not break what already works)

Step 3: Run a baseline eval

Before changing anything, run your current system against the new dataset:

python

Eval(
    "Customer Support Chatbot",
    data=lambda: account_issues_dataset,
    task=current_task,
    scores=[brand_alignment_scorer, conversation_quality_scorer],
    experiment_name="baseline_account_issues",
)

Record the scores. This is your "before" measurement.

Step 4: Test a hypothesis

Now make your change. For example, if the bot is struggling with account and billing questions, you might update the system prompt to include specific policy information:

python

updated_system_prompt = """You are an efficient customer support agent.
When handling account access issues, follow these steps:
1. Verify the customer's identity.
2. Check for any security holds on the account.
3. Walk through the account recovery process.

For billing questions, reference the current pricing at
support.example.com/pricing."""

Change one thing at a time. If you change the prompt and the model simultaneously, you will not know which change helped or hurt.

Step 5: Run the eval and compare

Run the eval again with your fix:

python

Eval(
    "Customer Support Chatbot",
    data=lambda: account_issues_dataset,
    task=updated_task,
    scores=[brand_alignment_scorer, conversation_quality_scorer],
    experiment_name="fix_account_issues_v1",
)

Compare the new experiment against the baseline. Check whether the failing cases improved, whether the passing cases stayed the same, and whether the borderline cases moved in the right direction.

When the fix does not work

Sometimes the fix does not help, and understanding why is just as valuable as a successful fix.

For example, you might add detailed policy information to the system prompt and find that scores do not improve. When you dig into the traces, you discover that the scorer itself is not calibrated for account-related conversations. The brand alignment rubric was designed for general support interactions and does not account for the specific language patterns in account recovery flows.

This reveals two paths forward:

Calibrate the scorer. Update the scoring rubric to handle account-related conversations correctly, then re-run the eval.
Refine the prompt further. Try a different approach to the prompt change, informed by what the traces revealed.

Both are valid next steps, and both feed back into the loop. The important thing is that the eval data tells you where to look, rather than guessing.

The loop is continuous

The improvement loop is not a one-time process. After deploying a fix:

Online scoring monitors whether the fix holds in production.
Topics show whether the problem category's scores improved.
New failure cases get sampled into datasets for the next iteration.

Each cycle makes your system better, and the data accumulates over time. Your datasets grow from real production failures, your scorers get refined as you discover edge cases, and your prompts evolve based on evidence rather than intuition.

That's a wrap

And that's the end of the foundations course. You started with the question of why evals matter, built your first eval in the UI and then in code, compared experiments, learned to read traces, shipped a multi-turn chatbot with production logging, set up online scoring, clustered logs into topics, and closed the loop by turning real failures into better datasets and better prompts. You now have the full workflow for building, monitoring, and improving an AI system with evals.

Come share what you're building and any feedback on the course on Discord. And if you're interested in using Braintrust at your company, reach out.

Evals

Connecting everything

Step 1: Find a problem in production

Step 2: Create a dataset from production logs

Step 3: Run a baseline eval

Step 4: Test a hypothesis

Step 5: Run the eval and compare

When the fix does not work

The loop is continuous

That's a wrap

Further reading

Trace everything