Skip to main content
Scoring logs in Braintrust allows you to evaluate the quality of your AI application’s performance in real-time through online evaluations. Unlike experiments that evaluate pre-defined datasets, online scoring automatically runs evaluations on your production logs as they are generated, providing continuous monitoring and quality assurance. To score historical logs manually rather than automatically in real-time as they are generated, see manual scoring.

What is online scoring?

Online scoring, also known as online evaluations, runs evaluations on traces as they are logged in production. This enables you to:
  • Monitor quality continuously without manually running evaluations
  • Catch regressions in production performance immediately
  • Evaluate at scale across all your production traffic
  • Get insights into real user interactions and edge cases
Online scoring runs asynchronously in the background, so it doesn’t impact your application’s latency or performance.

Set up online scoring

Online scoring is configured at the project level through the Configuration page. You can set up multiple scoring rules with different sampling rates and filters.

Create online scoring rules

Navigate to Configuration > Online scoring > + Create rule. Configure score

Configure online scoring rule parameters

For each online scoring rule, you can configure:
  • Rule name: Unique identifier for the rule
  • Description: Explanation of what the rule does and why it exists
  • Scorers: Choose from autoevals or any custom scorers in the current project or in another project
  • Sampling rate: Percentage of logs to evaluate (for example, 10% for high-volume applications)
  • BTQL filter clause: Filter spans based on their data (input, output, metadata, etc.) using a BTQL filter clause. Only spans where the BTQL filter clause evaluates to true will be considered for scoring.
The != operator is not supported in this specific context (fails silently). Use IS NOT instead.
  • Apply to spans: Among spans that pass the BTQL filter (if any), choose which span types to actually score:
    • Root spans toggle: Score root spans (top-level spans with no parent)
    • Span names field: Score spans with specific names (comma-separated list)
    • Note: If both options are enabled, spans matching EITHER criterion will be scored. If neither option is enabled, ALL spans that pass the BTQL filter will be scored.

Test online scoring rules

Preview how your online scoring rule will perform by selecting Test rule at the bottom of the configuration dialog. This allows you to see sample results before enabling the rule.

Types of scorers for online evaluation

Online scoring supports the same types of scorers you can use in experiments. You can use pre-built scorers from the autoevals library or create custom code-based scorers written in TypeScript or Python that implement your specific evaluation logic. For more information on creating scorers, check out the scorers guide.

View online scoring results

In logs view

Online scoring results appear automatically in your logs. Each scored span will show:
  • Score value: The numerical result (0-1 or 0-100 depending on scorer)
  • Scoring span: A child span containing the evaluation details
  • Scorer name: Which scorer generated the result
Scoring span

Best practices for online scoring

Choose your sampling rate based on application volume and criticality. High-volume applications should use lower sampling rates (1-10%) to manage costs, while low-volume or critical applications can afford higher rates (50-100%) for comprehensive coverage. Since online scoring runs asynchronously, it won’t impact your application’s latency, though LLM-as-a-judge scorers may have higher latency and costs than code-based alternatives. Online scoring works best as a complement to offline experimentation, helping you validate experiment results in production, monitor deployed changes for quality regressions, and identify new test cases from real user interactions.

Troubleshooting common issues

Low or inconsistent scores

  • Review scorer logic: Ensure scoring criteria match expectations
  • Check input/output format: Verify scorers receive expected data structure
  • Test with known examples: Validate scorer behavior on controlled inputs
  • Refine evaluation prompts: Make LLM-as-a-judge criteria more specific

Missing scores

  • Check Apply to spans settings in online scoring rule: Ensure the root spans toggle and/or span names field target the correct span types
  • Check BTQL filter clause in online scoring rule: Confirm your logs’ data (input, output, metadata) passes the BTQL filter clause. Also see note about unsupported BTQL operators
  • Check sampling rate in online scoring rule: Low sampling may result in sparse scoring
  • Check token permissions: Ensure your API key or service token has access to scorer projects and has ‘Read’ and ‘Update’ permissions on the project and project logs
  • Check span data completeness at end time:
    • Online scoring currently triggers when span.end() is called (or automatically when using wrapTraced() in TypeScript or @traced decorator in Python)
    • The online scoring rule’s BTQL filter clause evaluates only the data present at the moment when span.end() is called
    • If a span is updated after calling end() (e.g., logging output after ending), the update won’t be evaluated by the BTQL filter clause. For example, if your filter requires output IS NOT NULL but output is logged after span.end(), the span won’t be scored

Manual scoring

You can manually apply scorers to historical logs. When applied, scores will show up as additional spans within the log’s trace. There are three ways to manually score logs:
  • Specific logs: Select the logs you’d like to score, then select Score to apply the chosen scorers
  • Individual logs: Navigate into any individual log and use the Score button in the trace view to apply scorers to that specific log
  • Bulk filtered logs: Use filters to narrow your view, then select Score past logs under Online scoring to apply scorers to the 50 most recent logs matching your filters
Apply score