Score logs

Scoring logs in Braintrust allows you to evaluate the quality of your AI application's performance in real-time through online evaluations. Unlike experiments that evaluate pre-defined datasets, online scoring automatically runs evaluations on your production logs as they are generated, providing continuous monitoring and quality assurance.

Manual scoring

You can manually apply scorers to past logs using existing scorers in your project. There are three ways to manually score logs:

Specific logs: Select the logs you'd like to score, then select Score to apply the chosen scorers
Individual logs: Navigate into any log and use the Score button in the trace view to apply scorers to that specific log
Bulk filtered logs: Use filters to narrow your view, then select Score past logs under Online scoring to apply scorers to the most recent 50 logs matching your filters

Apply score

What is online scoring?

Online scoring, also known as online evaluations, runs evaluations on traces as they are logged in production. This enables you to:

Monitor quality continuously without manually running evaluations
Catch regressions in production performance immediately
Evaluate at scale across all your production traffic
Get insights into real user interactions and edge cases

Online scoring runs asynchronously in the background, so it doesn't impact your application's latency or performance.

Set up online scoring

Online scoring is configured at the project level through the Configuration page. You can set up multiple scoring rules with different sampling rates and filters.

Create scoring rules

Navigate to your project's Configuration page
Scroll down to the Online scoring section
Select Add rule to create a new online scoring rule

Configure score

Configure rule parameters

For each scoring rule, you can configure:

Scorer: Choose from autoevals or any custom scorers in your project
Sampling rate: Percentage of logs to evaluate (for example, 10% for high-volume applications)
BTQL filtering: Apply a BTQL filter clause to include only logs that match your filter criteria
Span filtering: Target specific spans (for example, a specific function call) instead of the default of all root spans for more granular evaluation.
Metadata filters: Evaluate only logs matching specific criteria

Test scoring rules

Preview how your scoring rule will perform by selecting Test rule at the bottom of the configuration dialog. This allows you to see sample results before enabling the rule.

Types of scorers for online evaluation

Online scoring supports the same types of scorers you can use in experiments. You can use pre-built scorers from the autoevals library or create custom code-based scorers written in TypeScript or Python that implement your specific evaluation logic. For more information on creating scorers, check out the scorers guide.

View online scoring results

In logs view

Online scoring results appear automatically in your logs. Each scored span will show:

Score value: The numerical result (0-1 or 0-100 depending on scorer)
Scoring span: A child span containing the evaluation details
Scorer name: Which scorer generated the result

Scoring span

Best practices for online scoring

Choose your sampling rate based on application volume and criticality. High-volume applications should use lower sampling rates (1-10%) to manage costs, while low-volume or critical applications can afford higher rates (50-100%) for comprehensive coverage. Since online scoring runs asynchronously, it won't impact your application's latency, though LLM-as-a-judge scorers may have higher latency and costs than code-based alternatives.

Online scoring works best as a complement to offline experimentation, helping you validate experiment results in production, monitor deployed changes for quality regressions, and identify new test cases from real user interactions.

Troubleshooting common issues

Low or inconsistent scores

Review scorer logic: Ensure scoring criteria match expectations
Check input/output format: Verify scorers receive expected data structure
Test with known examples: Validate scorer behavior on controlled inputs
Refine evaluation prompts: Make LLM-as-a-judge criteria more specific

Missing scores

Check span filtering: Ensure rules target the correct span types
Verify metadata filters: Confirm logs match configured criteria
Review sampling rate: Low sampling may result in sparse scoring
Verify token permissions: Ensure your API key or service token has access to scorer projects and has 'Read' and 'Update' permissions on the project and project logs

On this page