How online scoring works
Online scoring runs evaluations asynchronously in the background on your production logs. This enables you to:- Monitor quality continuously across all production traffic
- Catch regressions immediately when they occur
- Evaluate at scale without manual intervention
- Get insights into real user interactions and edge cases
Configure scoring rules
Set up online scoring in your project’s Configuration page under Online scoring.Create a rule
Navigate to Configuration > Online scoring > + Create rule.
Rule parameters
Configure each rule with:- Rule name: Unique identifier.
- Description: Explanation of the rule’s purpose.
- Scorers: Choose from autoevals or custom scorers in your project
- Sampling rate: Percentage of logs to evaluate (e.g., 10% for high-volume apps).
- SQL filter: Filter spans based on input, output, metadata, etc. Only spans matching the filter are scored.
-
Apply to spans: Choose which span types to score:
- Root spans (top-level spans with no parent)
- Specific span names (comma-separated list)
Test before enabling
Select Test rule to preview how your rule will perform before enabling it.View scoring results
Scores appear automatically in your logs. Each scored span shows:- Score value (0-1 or 0-100 depending on scorer)
- Scoring span containing evaluation details
- Scorer name that generated the result

Score manually
Apply scorers to historical logs in three ways:- Specific logs: Select logs and use Score to apply chosen scorers
- Individual logs: Open any log and use Score in the trace view
- Bulk filtered logs: Filter logs, then use Score past logs under Online scoring to score the 50 most recent matching logs

Best practices
Choose sampling rates wisely: High-volume applications should use lower rates (1-10%) to manage costs. Low-volume or critical applications can use higher rates (50-100%) for comprehensive coverage. Complement offline evaluation: Use online scoring to validate experiment results in production, monitor deployed changes, and identify new test cases from real interactions. Consider scorer costs: LLM-as-a-judge scorers have higher latency and costs than code-based alternatives. Factor this into your sampling rate decisions.Troubleshoot issues
Low or inconsistent scores
- Review scorer logic to ensure criteria match expectations.
- Verify scorers receive the expected data structure.
- Test scorer behavior on controlled inputs.
- Make LLM-as-a-judge criteria more specific.
Missing scores
- Check Apply to spans settings to ensure correct span types are targeted.
-
Verify logs pass the SQL filter clause.
The
!=operator is not supported in SQL filters for online scoring. UseIS NOTinstead. - Confirm sampling rate isn’t too low.
- Ensure API key has proper permissions (Read and Update on project and project logs).
-
Verify span data is complete when
span.end()is called - data added after ending won’t trigger scoring.
Next steps
- Create dashboards to monitor score trends
- Build datasets from scored production traces
- Run experiments to validate scoring criteria
- Learn about creating scorers