Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Close the loop: find problems in production, sample them into a dataset, run a baseline, test a fix, and verify the results. The full eval-driven improvement cycle.
All the assets for this module are available at braintrustdata/eval-101-course/module-14.
This module ties together everything from the course. You built the chatbot in module 10, logged conversations and scored them with online scoring in modules 11 and 12. In modules 3 through 9, you created datasets and ran evals. Two connections still need to be made:
The eval improvement loop is the workflow that connects these pieces into a repeatable cycle.
Start with your production logs. Use the topic maps from the previous module to identify which categories are underperforming. Combine topics with online scoring data to find the intersection of a specific topic and low scores.
For example, you might notice that conversations tagged "Account access and billing issues" have significantly lower brand alignment scores than other categories. That is your signal.
Filter your logs to the problem topic, then filter further by low scores. Select a representative set of failing cases and create a dataset directly from the filtered logs.
This dataset should include:
Before changing anything, run your current system against the new dataset:
Eval(
"Customer Support Chatbot",
data=lambda: account_issues_dataset,
task=current_task,
scores=[brand_alignment_scorer, conversation_quality_scorer],
experiment_name="baseline_account_issues",
)
Record the scores. This is your "before" measurement.
Now make your change. For example, if the bot is struggling with account and billing questions, you might update the system prompt to include specific policy information:
updated_system_prompt = """You are an efficient customer support agent.
When handling account access issues, follow these steps:
1. Verify the customer's identity.
2. Check for any security holds on the account.
3. Walk through the account recovery process.
For billing questions, reference the current pricing at
support.example.com/pricing."""
Change one thing at a time. If you change the prompt and the model simultaneously, you will not know which change helped or hurt.
Run the eval again with your fix:
Eval(
"Customer Support Chatbot",
data=lambda: account_issues_dataset,
task=updated_task,
scores=[brand_alignment_scorer, conversation_quality_scorer],
experiment_name="fix_account_issues_v1",
)
Compare the new experiment against the baseline. Check whether the failing cases improved, whether the passing cases stayed the same, and whether the borderline cases moved in the right direction.
Sometimes the fix does not help, and understanding why is just as valuable as a successful fix.
For example, you might add detailed policy information to the system prompt and find that scores do not improve. When you dig into the traces, you discover that the scorer itself is not calibrated for account-related conversations. The brand alignment rubric was designed for general support interactions and does not account for the specific language patterns in account recovery flows.
This reveals two paths forward:
Both are valid next steps, and both feed back into the loop. The important thing is that the eval data tells you where to look, rather than guessing.
The improvement loop is not a one-time process. After deploying a fix:
Each cycle makes your system better, and the data accumulates over time. Your datasets grow from real production failures, your scorers get refined as you discover edge cases, and your prompts evolve based on evidence rather than intuition.
And that's the end of the foundations course. You started with the question of why evals matter, built your first eval in the UI and then in code, compared experiments, learned to read traces, shipped a multi-turn chatbot with production logging, set up online scoring, clustered logs into topics, and closed the loop by turning real failures into better datasets and better prompts. You now have the full workflow for building, monitoring, and improving an AI system with evals.
Come share what you're building and any feedback on the course on Discord. And if you're interested in using Braintrust at your company, reach out.