Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Set up automation rules so brand alignment and conversation quality scores run automatically on every new production log.
All the assets for this module are available at braintrustdata/eval-101-course/module-12.
In the previous module, you wrote a script that manually fetched logs, ran scorers, and wrote results back. This works for one-time analysis, but you need scoring to happen automatically on every new conversation. That is online scoring.
Online scoring runs your scorers against production logs as they arrive, so you get continuous quality monitoring without manual intervention.
In the Braintrust UI, navigate to your project's logs, select Automations, and create a new automation rule.
Each automation rule defines:
For per-turn brand alignment scoring, create an automation rule with these settings:
metadata.turn_number is not null, to ensure you only score spans that represent individual conversation turns.This scores each turn independently as it gets logged.
For trace-level conversation quality scoring, create a second automation rule:
This waits for the full conversation to complete, then evaluates it as a unit. Trace-level scoring catches problems that per-turn scoring misses, such as unresolved issues, repeated questions, or context loss.
After setting up both automation rules, generate some production logs. You can run your multi-turn chat application with a batch of test conversations to verify that scoring is working.
Open the Logs tab and expand a trace. You should see scores attached to both individual spans (brand alignment) and the root trace (conversation quality). The scores appear automatically, with no manual script needed.
The most interesting results are conversations where per-turn and trace-level scores disagree. For example:
These disagreements reveal the kinds of issues each scorer is designed to catch and help you refine both your application and your scoring strategy.
In the next module, you'll learn how to analyze production logs at scale using topics, which automatically cluster your conversations into named categories.