Skip to main content
Online scoring evaluates production traces automatically as they’re logged, providing continuous quality monitoring without impacting application performance.

How online scoring works

Online scoring runs evaluations asynchronously in the background on your production logs. This enables you to:
  • Monitor quality continuously across all production traffic
  • Catch regressions immediately when they occur
  • Evaluate at scale without manual intervention
  • Get insights into real user interactions and edge cases
Scoring happens asynchronously, so it doesn’t affect your application’s latency or performance.

Configure scoring rules

Set up online scoring in your project’s Configuration page under Online scoring.

Create a rule

Navigate to Configuration > Online scoring > + Create rule. Configure score

Rule parameters

Configure each rule with:
  • Rule name: Unique identifier.
  • Description: Explanation of the rule’s purpose.
  • Scorers: Choose from autoevals or custom scorers in your project
  • Sampling rate: Percentage of logs to evaluate (e.g., 10% for high-volume apps).
  • SQL filter: Filter spans based on input, output, metadata, etc. Only spans matching the filter are scored.
  • Apply to spans: Choose which span types to score:
    • Root spans (top-level spans with no parent)
    • Specific span names (comma-separated list)
    If both options are enabled, spans matching either criterion are scored. If neither is enabled, all spans passing the SQL filter are scored.

Test before enabling

Select Test rule to preview how your rule will perform before enabling it.

View scoring results

Scores appear automatically in your logs. Each scored span shows:
  • Score value (0-1 or 0-100 depending on scorer)
  • Scoring span containing evaluation details
  • Scorer name that generated the result
Scoring span

Score manually

Apply scorers to historical logs in three ways:
  • Specific logs: Select logs and use Score to apply chosen scorers
  • Individual logs: Open any log and use Score in the trace view
  • Bulk filtered logs: Filter logs, then use Score past logs under Online scoring to score the 50 most recent matching logs
Apply score

Best practices

Choose sampling rates wisely: High-volume applications should use lower rates (1-10%) to manage costs. Low-volume or critical applications can use higher rates (50-100%) for comprehensive coverage. Complement offline evaluation: Use online scoring to validate experiment results in production, monitor deployed changes, and identify new test cases from real interactions. Consider scorer costs: LLM-as-a-judge scorers have higher latency and costs than code-based alternatives. Factor this into your sampling rate decisions.

Troubleshoot issues

Low or inconsistent scores

  • Review scorer logic to ensure criteria match expectations.
  • Verify scorers receive the expected data structure.
  • Test scorer behavior on controlled inputs.
  • Make LLM-as-a-judge criteria more specific.

Missing scores

  • Check Apply to spans settings to ensure correct span types are targeted.
  • Verify logs pass the SQL filter clause.
    The != operator is not supported in SQL filters for online scoring. Use IS NOT instead.
  • Confirm sampling rate isn’t too low.
  • Ensure API key has proper permissions (Read and Update on project and project logs).
  • Verify span data is complete when span.end() is called - data added after ending won’t trigger scoring.

Next steps