Online scoring

Set up automation rules so brand alignment and conversation quality scores run automatically on every new production log.

All the assets for this module are available at braintrustdata/eval-101-course/module-12.

From manual scripts to automation

In the previous module, you wrote a script that manually fetched logs, ran scorers, and wrote results back. This works for one-time analysis, but you need scoring to happen automatically on every new conversation. That is online scoring.

Online scoring runs your scorers against production logs as they arrive, so you get continuous quality monitoring without manual intervention.

Setting up automation rules

In the Braintrust UI, navigate to your project's logs, select Automations, and create a new automation rule.

Each automation rule defines:

Which scorer to run. Select one of your existing scorers (for example, Brand Alignment or Conversation Quality).
The scope. Span scope scores individual spans. Trace scope scores the full conversation.
Filters. Optionally restrict which logs get scored based on metadata or other fields.

Configuring per-turn scoring

For per-turn brand alignment scoring, create an automation rule with these settings:

Select your Brand Alignment scorer.
Set the scope to span.
Add a filter on metadata, for example metadata.turn_number is not null, to ensure you only score spans that represent individual conversation turns.

This scores each turn independently as it gets logged.

Configuring trace-level scoring

For trace-level conversation quality scoring, create a second automation rule:

Select your Conversation Quality scorer.
Set the scope to trace.

This waits for the full conversation to complete, then evaluates it as a unit. Trace-level scoring catches problems that per-turn scoring misses, such as unresolved issues, repeated questions, or context loss.

Seeing online scoring in action

After setting up both automation rules, generate some production logs. You can run your multi-turn chat application with a batch of test conversations to verify that scoring is working.

Open the Logs tab and expand a trace. You should see scores attached to both individual spans (brand alignment) and the root trace (conversation quality). The scores appear automatically, with no manual script needed.

Finding disagreements between scorers

The most interesting results are conversations where per-turn and trace-level scores disagree. For example:

A conversation where every individual turn scores 100% on brand alignment, but the conversation quality score is 40%. This means the bot sounded good on each response but failed to resolve the customer's issue.
A conversation with mixed per-turn scores but a high conversation quality score. The bot was imperfect on individual turns but still got the job done.

These disagreements reveal the kinds of issues each scorer is designed to catch and help you refine both your application and your scoring strategy.

What's next

In the next module, you'll learn how to analyze production logs at scale using topics, which automatically cluster your conversations into named categories.

Evals