Evaluate output quality with scorers and classifiers

Scorers evaluate AI output quality by assigning scores between 0 and 1 based on criteria you define, like factual accuracy, helpfulness, or correct formatting. Classifiers categorize AI output by assigning a categorical label instead of a numeric score, along a dimension you define, like intent, sentiment, or topic. Run scorers and classifiers on experiment test cases, or apply them continuously with online scoring on production traces. Results show up alongside your experiments and logs, so you can monitor quality over time, filter by category, and add the most useful examples back into your datasets.

Scorers

Scorers return a numeric score between 0 and 1, measuring qualities like factual accuracy, helpfulness, or correct formatting. Use a scorer to:

Track quality over time.
Compute average quality scores.
Set pass/fail thresholds.
Rank and compare outputs.

A scorer receives the input, output, expected, metadata, and trace for each result, and returns a number between 0 and 1 (optionally with a name and metadata). There are three types of scorers:

Autoevals: Pre-built, battle-tested scorers for common evaluation tasks like factuality checking, semantic similarity, and format validation. Best for standard evaluation needs where reliable scorers already exist.
LLM-as-a-judge: Use language models to evaluate outputs based on natural language criteria and instructions. Best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in deterministic code.
Custom code: Write custom evaluation logic with full control over the scoring algorithm. Best for specific business rules, pattern matching, or calculations unique to your use case.

Classifiers

Classifiers receive the same arguments as a scorer, but they return a classification (a categorical label) instead of a numeric score. A label groups outputs into categories without ranking them, so unlike a score it implies no relative quality. Use a classifier to:

Categorize outputs by a dimension like sentiment, intent, or issue type.
Filter results by category.
Build evaluation datasets from the outputs in a category.

A classification has a name (the group it belongs to, such as intent), an id (the value within that group, such as password_reset), an optional label for display (such as Password reset), and optional metadata. In the experiments and logs tables and in playground results, each classifier appears as its own column under a classifications. prefix that you can sort and filter. In experiments and playgrounds, you can also group rows by a classifier column. See Interpret evaluation results for details. There are two types of classifiers:

LLM-as-a-judge: Use a language model to choose a label from a fixed set based on natural language criteria. Best for subjective or semantic categories like intent, sentiment, or topic, where the right label depends on understanding the content.
Custom code: Write logic that returns a label directly. Best for deterministic categories you can derive from explicit rules, patterns, or computed values.

Classifiers require TypeScript SDK v3.9.0+, Python SDK v0.16.0+, Go SDK v0.8.0+, Java SDK v0.3.12+, Ruby SDK v0.4.0+, or C# SDK v0.2.8+. On self-hosted deployments, classifiers require data plane v2.0 or later.

To classify logs automatically instead of defining your own classifier, use Topics, which generates classifications from your production traffic.

Where to define scorers and classifiers

You can define scorers in three places, and classifiers in two of them (inline in your eval or in the UI):

Inline in SDK code: Define scorers directly in your evaluation scripts. Best for local development, access to complex dependencies, or application-specific logic tightly coupled to your codebase.
Pushed via CLI: Define TypeScript or Python scorers in code files and push them to Braintrust. Best for version control in Git, team-wide sharing across projects, and automatic evaluation of production logs.
Created in UI: Build TypeScript or Python scorers in the Braintrust web interface. Best for rapid prototyping and simple LLM-as-a-judge scorers and classifiers.

Most teams prototype in the UI, develop complex scorers inline, then push production-ready scorers to Braintrust for team-wide use.

In the SDKs, scorers and classifiers are distinct function types, with a dedicated parameter for classifiers separate from scorers. In the UI, you create a classifier by setting a scorer’s Output type to Classification.

Scorer and classifier scopes

Scorers and classifiers run at one of two scopes:

Span: Runs on an individual span, such as a single LLM response or tool call. Each matching span is evaluated independently.
Trace: Runs once on an entire trace after it completes, with access to all of its spans.

Whether a scorer or classifier works at trace level depends on how it’s built. An LLM-as-a-judge prompt that uses thread variables such as {{thread}}, or custom code that reads the trace argument, evaluates the full trace. Autoevals can’t access the trace, so they always run at span scope. For online scoring, set the rule’s Scope field to Trace to run on the full trace, or Span to run on individual spans.

Test scorers and classifiers

Scorers and classifiers need to be developed iteratively against real data. When creating or editing one in the UI, use the Run section to test it with data from different sources. Each source populates the scorer’s input parameters (like input, output, expected, metadata) from a different location.

Test with manual input

Best for initial development when you have a specific example in mind. Use this to quickly prototype and verify basic scorer or classifier logic before testing on larger datasets.

Click Test with [current source] in the Run section toolbar and choose Editor.
Enter values for input, output, expected, and metadata fields.
Click Test to see how your scorer evaluates the example.
Iterate on your scorer logic based on the results.

Test with a dataset

Best for testing specific scenarios, edge cases, or regression testing. Use this when you want controlled, repeatable test cases or need to ensure your scorer handles specific situations correctly.

Click Test with [current source] in the Run section toolbar and choose Dataset.
Choose a dataset from your project.
Select a record to test with.
Click Test to see how your scorer evaluates the example.
Review results to identify patterns and edge cases.

You can test with a standard dataset row ({input, expected, metadata}) directly. Braintrust transforms the row into the {input, output, expected, metadata} shape a scorer receives when it runs against data from logs, experiments, and playgrounds. When the transformed row differs from the raw row, the panel shows a side-by-side Raw dataset row and What the scorer receives view. Datasets don’t have a top-level output field. In this UI, input.output is accessible in your scorer as output. Other dataset keywords are also hoisted, meaning input.expected (if defined) overwrites the top-level expected field.

Test with logs

Best for testing against actual usage patterns and debugging real-world edge cases. Use this when you want to see how your scorer or classifier performs on data your system is actually generating.

Click Test with [current source] in the Run section toolbar and choose Logs.
Select the project containing the logs you want to test against.
Filter logs to find relevant examples:
- Click Filter and choose just root spans, specific span names, or a more advanced filter based on specific input, output, metadata, or other values.
- Select a timeframe.
The matching trace appears in the panel. Click into spans to inspect inputs, outputs, and metadata, and use the prev and next controls in the toolbar to step through matching root spans. What the scorer receives depends on its shape:
- For span scorers, the selected span is the one passed to the scorer.
- For trace scorers, the entire trace is passed regardless of which span you select, and input, output, expected, and metadata are populated from the root span.
Click Test to see how your scorer evaluates real production data.
Identify cases where the scorer needs adjustment for real-world scenarios.

To create a new online scoring rule with the filters automatically prepopulated from your current log filters, click Automations. This enables rapid iteration from logs to scoring rules. See Create scoring rules for more details.

Scorer and classifier permissions

Both LLM-as-a-judge and custom code scorers and classifiers automatically receive a BRAINTRUST_API_KEY environment variable that allows them to:

Make LLM calls using organization and project AI secrets
Access attachments from the current project
Read and write logs to the current project
Read prompts from the organization

For custom code scorers and classifiers that need expanded permissions beyond the current project (such as logging to other projects, reading datasets, or accessing other organization data), you can provide your own API key using the PUT /v1/env_var endpoint.

Optimize with Loop

Generate and improve scorers and classifiers using Loop: Example queries:

“Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
“Generate a code-based scorer based on project logs”
“Optimize the Helpfulness scorer”
“Adjust the scorer to be more lenient”

Best practices

Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer and classifier prompts or code. Use multiple scorers or classifiers: Measure different aspects (factuality, helpfulness, tone) with separate scorers and classifiers. Choose the right scope: Use trace scorers and classifiers for multi-step workflows and agents. Use span scorers and classifiers for simple quality checks. Test scorers: Run scorers and classifiers on known examples to verify they behave as expected. Version scorers: Like prompts, scorers and classifiers are versioned automatically. Track what works. Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than custom code scorers.

Create custom table views

The Scorers page supports custom table views to save your preferred filters, column order, and display settings. To create or update a custom table view:

Apply the filters and display settings you want.
Open the menu and select Save view… or Save view as….

Custom table views are visible to all project members. Creating or editing a table view requires the Update project permission.

Set default table views

You can set default views at three levels:

Organization default: Visible to all members when they open the page. This applies per page. For example, you can set separate organization defaults for Logs, Experiments, and Review. To set an organization default, you need the Manage settings organization permission (included by default in the Owner role). See Access control for details.
Project default: Overrides the organization default for everyone viewing this project. To set a project default, you need the project-level Update permission. Project admins can set project defaults even without organization-level permissions. See Access control for details.
Personal default: Overrides the project and organization defaults for you only. Personal defaults are stored in your browser, so they do not carry over across devices or browsers.

To set a default view:

Switch to the view you want by selecting it from the menu.
Open the menu again and hover over the currently selected view to reveal its submenu.
Choose Set as personal default view, Set as project default view, or Set as organization default view.

To clear a default view:

Open the menu and hover over the currently selected view to reveal its submenu.
Choose Clear personal default view, Clear project default view, or Clear organization default view.

Default view settings are mutually exclusive on a given view. Setting one type of default on a view automatically clears any other default that was previously set on the same view. When a user opens a page, Braintrust loads the first match in this order: personal default, project default, organization default, then the standard “All …” view (for example, “All logs view”).

Next steps

Autoevals: Drop-in pre-built scorers
LLM-as-a-judge: Natural language evaluation criteria
Custom code: Full control over scoring logic
Run evaluations using your scorers
Score production logs with online scoring rules

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Evaluate output quality with scorers and classifiers

Scorers

Classifiers

Where to define scorers and classifiers

Scorer and classifier scopes

Test scorers and classifiers

Test with manual input

Test with a dataset

Test with logs

Scorer and classifier permissions

Optimize with Loop

Best practices

Create custom table views

Set default table views

Next steps

​Scorers

​Classifiers

​Where to define scorers and classifiers

​Scorer and classifier scopes

​Test scorers and classifiers

​Test with manual input

​Test with a dataset

​Test with logs

​Scorer and classifier permissions

​Optimize with Loop

​Best practices

​Create custom table views

​Set default table views

​Next steps

Scorers

Classifiers

Where to define scorers and classifiers

Scorer and classifier scopes

Test scorers and classifiers

Test with manual input

Test with a dataset

Test with logs

Scorer and classifier permissions

Optimize with Loop

Best practices

Create custom table views

Set default table views

Next steps