Score logs
Scoring logs in Braintrust allows you to evaluate the quality of your AI application's performance in real-time through online evaluations. Unlike experiments that evaluate pre-defined datasets, online scoring automatically runs evaluations on your production logs as they are generated, providing continuous monitoring and quality assurance.
What is online scoring?
Online scoring, also known as online evaluations, runs evaluations on traces as they are logged in production. This enables you to:
- Monitor quality continuously without manually running evaluations
- Catch regressions in production performance immediately
- Evaluate at scale across all your production traffic
- Get insights into real user interactions and edge cases
Online scoring runs asynchronously in the background, so it doesn't impact your application's latency or performance.
Set up online scoring
Online scoring is configured at the project level through the Configuration page. You can set up multiple scoring rules with different sampling rates and filters.
Create scoring rules
- Navigate to your project's Configuration page
- Scroll down to the Online scoring section
- Select Add rule to create a new online scoring rule
Configure rule parameters
For each scoring rule, you can configure:
- Scorer: Choose from autoevals or any custom scorers in your project
- Sampling rate: Percentage of logs to evaluate (for example, 10% for high-volume applications)
- Span filtering: Target specific spans (for example, a specific function call)
- Metadata filters: Evaluate only logs matching specific criteria
Span targeting
By default, scoring applies to root spans, but you may want to target specific function spans for more granular evaluation.
Types of scorers for online evaluation
Online scoring supports the same types of scorers you can use in experiments. You can use pre-built scorers from the autoevals library or create custom code-based scorers written in TypeScript or Python that implement your specific evaluation logic. For more information on creating scorers, check out the scorers guide.
View online scoring results
In logs view
Online scoring results appear automatically in your logs. Each scored span will show:
- Score value: The numerical result (0-1 or 0-100 depending on scorer)
- Scoring span: A child span containing the evaluation details
- Scorer name: Which scorer generated the result
Best practices for online scoring
Choose your sampling rate based on application volume and criticality. High-volume applications should use lower sampling rates (1-10%) to manage costs, while low-volume or critical applications can afford higher rates (50-100%) for comprehensive coverage. Since online scoring runs asynchronously, it won't impact your application's latency, though LLM-as-a-judge scorers may have higher latency and costs than code-based alternatives.
Online scoring works best as a complement to offline experimentation, helping you validate experiment results in production, monitor deployed changes for quality regressions, and identify new test cases from real user interactions.
Troubleshooting common issues
Low or inconsistent scores
- Review scorer logic: Ensure scoring criteria match expectations
- Check input/output format: Verify scorers receive expected data structure
- Test with known examples: Validate scorer behavior on controlled inputs
- Refine evaluation prompts: Make LLM-as-a-judge criteria more specific
Missing scores
- Check span filtering: Ensure rules target the correct span types
- Verify metadata filters: Confirm logs match configured criteria
- Review sampling rate: Low sampling may result in sparse scoring