What is online scoring?
Online scoring, also known as online evaluations, runs evaluations on traces as they are logged in production. This enables you to:- Monitor quality continuously without manually running evaluations
- Catch regressions in production performance immediately
- Evaluate at scale across all your production traffic
- Get insights into real user interactions and edge cases
Set up online scoring
Online scoring is configured at the project level through the Configuration page. You can set up multiple scoring rules with different sampling rates and filters.Create online scoring rules
Navigate to Configuration > Online scoring > + Create rule.
Configure online scoring rule parameters
For each online scoring rule, you can configure:- Rule name: Unique identifier for the rule
- Description: Explanation of what the rule does and why it exists
- Scorers: Choose from autoevals or any custom scorers in the current project or in another project
- Sampling rate: Percentage of logs to evaluate (for example, 10% for high-volume applications)
- BTQL filter clause: Filter spans based on their data (input, output, metadata, etc.) using a BTQL filter clause. Only spans where the BTQL filter clause evaluates to true will be considered for scoring.
The
!= operator is not supported in this specific context (fails silently). Use IS NOT instead.- Apply to spans: Among spans that pass the BTQL filter (if any), choose which span types to actually score:
- Root spans toggle: Score root spans (top-level spans with no parent)
- Span names field: Score spans with specific names (comma-separated list)
- Note: If both options are enabled, spans matching EITHER criterion will be scored. If neither option is enabled, ALL spans that pass the BTQL filter will be scored.
Test online scoring rules
Preview how your online scoring rule will perform by selecting Test rule at the bottom of the configuration dialog. This allows you to see sample results before enabling the rule.Types of scorers for online evaluation
Online scoring supports the same types of scorers you can use in experiments. You can use pre-built scorers from the autoevals library or create custom code-based scorers written in TypeScript or Python that implement your specific evaluation logic. For more information on creating scorers, check out the scorers guide.View online scoring results
In logs view
Online scoring results appear automatically in your logs. Each scored span will show:- Score value: The numerical result (0-1 or 0-100 depending on scorer)
- Scoring span: A child span containing the evaluation details
- Scorer name: Which scorer generated the result

Best practices for online scoring
Choose your sampling rate based on application volume and criticality. High-volume applications should use lower sampling rates (1-10%) to manage costs, while low-volume or critical applications can afford higher rates (50-100%) for comprehensive coverage. Since online scoring runs asynchronously, it won’t impact your application’s latency, though LLM-as-a-judge scorers may have higher latency and costs than code-based alternatives. Online scoring works best as a complement to offline experimentation, helping you validate experiment results in production, monitor deployed changes for quality regressions, and identify new test cases from real user interactions.Troubleshooting common issues
Low or inconsistent scores
- Review scorer logic: Ensure scoring criteria match expectations
- Check input/output format: Verify scorers receive expected data structure
- Test with known examples: Validate scorer behavior on controlled inputs
- Refine evaluation prompts: Make LLM-as-a-judge criteria more specific
Missing scores
- Check Apply to spans settings in online scoring rule: Ensure the root spans toggle and/or span names field target the correct span types
- Check BTQL filter clause in online scoring rule: Confirm your logs’ data (input, output, metadata) passes the BTQL filter clause. Also see note about unsupported BTQL operators
- Check sampling rate in online scoring rule: Low sampling may result in sparse scoring
- Check token permissions: Ensure your API key or service token has access to scorer projects and has ‘Read’ and ‘Update’ permissions on the project and project logs
- Check span data completeness at end time:
- Online scoring currently triggers when
span.end()is called (or automatically when usingwrapTraced()in TypeScript or@traceddecorator in Python) - The online scoring rule’s BTQL filter clause evaluates only the data present at the moment when
span.end()is called - If a span is updated after calling
end()(e.g., logging output after ending), the update won’t be evaluated by the BTQL filter clause. For example, if your filter requiresoutput IS NOT NULLbut output is logged afterspan.end(), the span won’t be scored
- Online scoring currently triggers when
Manual scoring
You can manually apply scorers to historical logs. When applied, scores will show up as additional spans within the log’s trace. There are three ways to manually score logs:- Specific logs: Select the logs you’d like to score, then select Score to apply the chosen scorers
- Individual logs: Navigate into any individual log and use the Score button in the trace view to apply scorers to that specific log
- Bulk filtered logs: Use filters to narrow your view, then select Score past logs under Online scoring to apply scorers to the 50 most recent logs matching your filters
