Analyzing multi turn traces

Per-turn scorers catch individual response issues. Trace-level scorers catch conversation-wide failures like lost context or unresolved issues. Run both together.

All the assets for this module are available at braintrustdata/eval-101-course/module-11.

Why per-turn scoring is not enough

The brand alignment scorer from earlier modules evaluates one single turn. Multi-turn conversations often have failures that single-turn scoring cannot see.

Consider this example: a customer provides their order number on turn 1, and the AI bot asks for it again on turn 4. Each individual response could score 100% on brand alignment, but the conversation as a whole fails. The bot lost context, and the customer has to repeat themselves.

This is why you need trace-level scoring to evaluate a full conversation as one unit.

Per-turn vs. trace-level scorers

Per-turn scorers evaluate each individual response. They answer the question: "Was this specific reply good?"

Trace-level scorers evaluate the entire conversation. They answer: "Did this conversation accomplish its goal?"

You typically need both:

  • Per-turn scoring catches issues with tone, helpfulness, and factual accuracy on individual responses.
  • Trace-level scoring catches issues with context retention, resolution rate, and overall conversation quality.

Building a conversation quality scorer

You can build a trace-level scorer that evaluates the full conversation. This scorer looks at the entire message history and assesses whether the agent understood the issue, maintained context, and resolved the problem.

In the Braintrust UI, create a new scorer under your project's functions. Set it to evaluate the full conversation output rather than individual turns. The scoring prompt should cover criteria like:

  1. Did the agent understand the customer's issue?
  2. Did the agent maintain context across turns?
  3. Did the agent resolve or appropriately escalate the issue?
  4. Did the agent avoid asking for information the customer already provided?

Running both scorers

To see where per-turn and trace-level scores disagree, write a script that fetches your production logs and runs both scorers:

python
import braintrust

client = braintrust.init_logger(project="Customer Support Chatbot")

# Fetch recent logs
logs = client.fetch_logs()

for log in logs:
    # Per-turn scores are already attached to individual spans
    # Trace-level score evaluates the full conversation
    conversation_score = conversation_quality_scorer(log)
    client.log(scores={"conversation_quality": conversation_score}, id=log["id"])

The interesting cases are where the scores disagree. A conversation where every turn scores well on brand alignment but scores poorly on conversation quality indicates a structural problem, such as the bot going in circles or failing to resolve the issue.

Identifying disagreements

When you run both scorers across your logs, look for these patterns:

  • High per-turn, low trace-level. The bot sounds good on each turn but never resolves the issue. This often means the system prompt lacks guidance on resolution steps.
  • Low per-turn, high trace-level. Individual responses are rough, but the conversation still gets resolved. This might be acceptable depending on your quality bar.
  • Both low. Clear failures that need attention.
  • Both high. The system is working well for these cases.

The disagreement cases are the most valuable because they reveal blind spots in your scoring strategy.

What's next

In the next module, you'll set up online scoring so that both per-turn and trace-level scorers run automatically on every new conversation, without manual scripts.

Further reading

Trace everything