Score production traces

Online scoring evaluates production traces automatically as they’re logged, running evaluations asynchronously in the background to provide continuous quality monitoring without affecting your application’s latency or performance. This enables you to:

Monitor quality continuously across all production traffic
Catch regressions immediately when they occur
Evaluate at scale without manual intervention
Get insights into real user interactions and edge cases

Create scoring rules

Online scoring rules are defined at the project level and specify which scorers to run, how often, and on which logs. Once configured, these rules automatically evaluate production traces as they arrive.

In the Braintrust UI, create a scoring rule in project settings, when setting up a scorere, or when testing a scorer.

Project settings: Go to Settings > Project > Online scoring and click Add rule. See Manage projects for more details.
Scorer setup: When creating or editing a scorer, click Online scoring and either select an existing scoring rule or create a new rule. This workflow allows you to configure both the scorer and its scoring rules together. See Write scorers for more details.
Scorer testing: When testing a scorer with logs in the Run section, filter logs to find relevant examples, then click Online scoring to create a new online scoring rule with filters automatically prepopulated from your current log filters. This enables rapid iteration from logs to scoring rules. See Test with logs for more details.

Select Test rule to preview how your rule will perform before enabling it.

Rule parameters

Configure each rule with:

Rule name: Unique identifier.
Description: Explanation of the rule’s purpose.
Project: Select which project the rule belongs to. When creating from a scorer page, you can choose any project.
Scorers: Choose from autoevals or custom scorers from the current project or any other project in your organization. When creating from a scorer page, the current scorer is automatically selected.
Sampling rate: Percentage of logs to evaluate (e.g., 10% for high-volume apps).
SQL filter: Filter spans based on input, output, metadata, etc. using a SQL filter clause. Only spans matching the filter are scored. When creating from logs browser, filters are automatically prepopulated.
The != operator is not supported in SQL filters for online scoring. Use IS NOT instead.
Apply to spans: Choose which span types to score:
- Root spans (top-level spans with no parent)
- Specific span names (comma-separated list)
If both options are enabled, spans matching either criterion are scored. If neither is enabled, all spans passing the SQL filter are scored.

View scoring results

Scores appear automatically in your logs. Each scored span shows:

Score value (0-1 or 0-100 depending on scorer)
Scoring span containing evaluation details
Scorer name that generated the result

Score manually

Apply scorers to historical logs in three ways:

Specific logs: Select logs and use Score to apply chosen scorers
Individual logs: Open any log and use Score in the trace view
Bulk filtered logs: Filter logs, then use Score past logs under Online scoring to score the 50 most recent matching logs

Best practices

Choose sampling rates wisely: High-volume applications should use lower rates (1-10%) to manage costs. Low-volume or critical applications can use higher rates (50-100%) for comprehensive coverage. Complement offline evaluation: Use online scoring to validate experiment results in production, monitor deployed changes, and identify new test cases from real interactions. Consider scorer costs: LLM-as-a-judge scorers have higher latency and costs than code-based alternatives. Factor this into your sampling rate decisions.

Troubleshoot issues

Low or inconsistent scores

Review scorer logic to ensure criteria match expectations.
Verify scorers receive the expected data structure.
Test scorer behavior on controlled inputs.
Make LLM-as-a-judge criteria more specific.

Missing scores

Check Apply to spans settings to ensure correct span types are targeted (root spans or specific span names).
Verify logs pass the SQL filter clause. Confirm your logs’ data (input, output, metadata) matches the filter criteria.
Confirm sampling rate isn’t too low for your traffic volume.
Ensure API key or service token has proper permissions (Read and Update on project and project logs). If using scorers from other projects, ensure permissions on those projects as well.
Verify span data is complete when span.end() is called:
- Online scoring triggers when span.end() is called (or automatically when using wrapTraced() in TypeScript or @traced decorator in Python)
- The SQL filter clause evaluates only the data present at the moment span.end() is called
- If a span is updated after calling end() (e.g., logging output after ending), the update won’t be evaluated by the filter. For example, if your filter requires output IS NOT NULL but output is logged after span.end(), the span won’t be scored

Next steps

Create dashboards to monitor score trends
Build datasets from scored production traces
Run experiments to validate scoring criteria
Learn about creating scorers

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Score production traces

Create scoring rules

Rule parameters

View scoring results

Score manually

Best practices

Troubleshoot issues

Low or inconsistent scores

Missing scores

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Create scoring rules

​Rule parameters

​View scoring results

​Score manually

​Best practices

​Troubleshoot issues

​Low or inconsistent scores

​Missing scores

​Next steps

Create scoring rules

Rule parameters

View scoring results

Score manually

Best practices

Troubleshoot issues

Low or inconsistent scores

Missing scores

Next steps