Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Per-turn scorers catch individual response issues. Trace-level scorers catch conversation-wide failures like lost context or unresolved issues. Run both together.
All the assets for this module are available at braintrustdata/eval-101-course/module-11.
The brand alignment scorer from earlier modules evaluates one single turn. Multi-turn conversations often have failures that single-turn scoring cannot see.
Consider this example: a customer provides their order number on turn 1, and the AI bot asks for it again on turn 4. Each individual response could score 100% on brand alignment, but the conversation as a whole fails. The bot lost context, and the customer has to repeat themselves.
This is why you need trace-level scoring to evaluate a full conversation as one unit.
Per-turn scorers evaluate each individual response. They answer the question: "Was this specific reply good?"
Trace-level scorers evaluate the entire conversation. They answer: "Did this conversation accomplish its goal?"
You typically need both:
You can build a trace-level scorer that evaluates the full conversation. This scorer looks at the entire message history and assesses whether the agent understood the issue, maintained context, and resolved the problem.
In the Braintrust UI, create a new scorer under your project's functions. Set it to evaluate the full conversation output rather than individual turns. The scoring prompt should cover criteria like:
To see where per-turn and trace-level scores disagree, write a script that fetches your production logs and runs both scorers:
import braintrust
client = braintrust.init_logger(project="Customer Support Chatbot")
# Fetch recent logs
logs = client.fetch_logs()
for log in logs:
# Per-turn scores are already attached to individual spans
# Trace-level score evaluates the full conversation
conversation_score = conversation_quality_scorer(log)
client.log(scores={"conversation_quality": conversation_score}, id=log["id"])
The interesting cases are where the scores disagree. A conversation where every turn scores well on brand alignment but scores poorly on conversation quality indicates a structural problem, such as the bot going in circles or failing to resolve the issue.
When you run both scorers across your logs, look for these patterns:
The disagreement cases are the most valuable because they reveal blind spots in your scoring strategy.
In the next module, you'll set up online scoring so that both per-turn and trace-level scorers run automatically on every new conversation, without manual scripts.