Skip to main content
Scorers evaluate AI outputs by assigning scores between 0 and 100%. Use pre-built scorers from autoevals, create custom code-based scorers, or build LLM-as-a-judge scorers to measure what matters for your application.

Use autoevals

The autoevals library provides pre-built scorers for common evaluation tasks:
import { Factuality, Levenshtein, Semantic } from "autoevals";
from autoevals import Factuality, Levenshtein, Semantic
Popular autoevals scorers:
  • Factuality: Check if output contains factual information
  • Semantic: Measure semantic similarity to expected output
  • Levenshtein: Calculate edit distance from expected output
  • JSON: Validate JSON structure and content
  • SQL: Validate SQL query syntax and semantics
Use autoevals directly in evaluations:
import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],
});

Create custom scorers

For specialized evaluation, create custom scorers in TypeScript, Python, or as LLM-as-a-judge.
Security: For Braintrust-hosted deployments and self-hosted deployments on AWS, run in isolated AWS Lambda environments within a dedicated VPC that has no access to internal infrastructure. See code execution security for details.
Navigate to Scorers > + Scorer to create scorers in the UI.

Code-based scorers

Write TypeScript or Python code that evaluates outputs:Create TypeScript scorer
UI scorers have access to these packages:
  • anthropic
  • autoevals
  • braintrust
  • json
  • math
  • openai
  • re
  • requests
  • typing
For additional packages, use the SDK method below.

LLM-as-a-judge scorers

Define prompts that evaluate outputs and map choices to scores:Create LLM-as-a-judge scorerConfigure:
  • Prompt: Instructions for evaluating the output
  • Model: Which model to use as judge
  • Choice scores: Map model choices (A, B, C) to numeric scores
  • Use CoT: Enable chain-of-thought reasoning for complex evaluations

Scorer parameters

Scorers receive these parameters:
  • input: The input to your task
  • output: The output from your task
  • expected: The expected output (optional)
  • metadata: Custom metadata from the test case
Return a number between 0 and 1, or an object with score and optional metadata.

Scorer permissions

Both LLM-as-a-judge scorers and code-based scorers automatically receive a BRAINTRUST_API_KEY environment variable that allows them to:
  • Make LLM calls using organization and project AI secrets
  • Access attachments from the current project
  • Read and write logs to the current project
  • Read prompts from the organization
For code-based scorers that need expanded permissions beyond the current project (such as logging to other projects, reading datasets, or accessing other organization data), you can provide your own API key using the PUT /v1/env_var endpoint.

Set pass thresholds

Define minimum acceptable scores using __pass_threshold in metadata (value between 0 and 1):
metadata: {
  __pass_threshold: 0.7,  // Scores below 0.7 are considered failures
}
You can also set pass thresholds when creating or editing scorers in the UI using the threshold slider. When a scorer has a pass threshold configured:
  • Scores that meet or exceed the threshold are marked as passing and displayed with green highlighting and a checkmark
  • Scores below the threshold are marked as failing and displayed with red highlighting
This visual feedback makes it easy to scan evaluation results and identify which outputs meet your quality criteria at a glance.

Test scorers

Scorers need to be developed iteratively against real data. When creating or editing a scorer in the UI, use the Run section to test your scorer with data from different sources. Each variable source populates the scorer’s input parameters (like input, output, expected, metadata) from a different location.

Test with manual input

Best for initial development when you have a specific example in mind. Use this to quickly prototype and verify basic scorer logic before testing on larger datasets.
  1. Select Editor in the Run section.
  2. Enter values for input, output, expected, and metadata fields.
  3. Click Test to see how your scorer evaluates the example
  4. Iterate on your scorer logic based on the results

Test with a dataset

Best for testing specific scenarios, edge cases, or regression testing. Use this when you want controlled, repeatable test cases or need to ensure your scorer handles specific situations correctly.
  1. Select Dataset in the Run section.
  2. Choose a dataset from your project.
  3. Select a record to test with.
  4. Click Test to see how your scorer evaluates the example.
  5. Review results to identify patterns and edge cases.

Test with logs

Best for testing against actual usage patterns and debugging real-world edge cases. Use this when you want to see how your scorer performs on data your system is actually generating.
  1. Select Logs in the Run section.
  2. Select the project containing the logs you want to test against.
  3. Filter logs to find relevant examples:
    • Click Add filter and choose just root spans, specific span names, or a more advanced filter based on specific input, output, metadata, or other values.
    • Select a timeframe.
  4. Click Test to see how your scorer evaluates real production data.
  5. Identify cases where the scorer needs adjustment for real-world scenarios.
To create a new online scoring rule with the filters automatically prepopulated from your current log filters, click Online scoring. This enables rapid iteration from logs to scoring rules. See Create scoring rules for more details.

Optimize with Loop

Generate and improve scorers using Loop: Example queries:
  • “Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
  • “Generate a code-based scorer based on project logs”
  • “Optimize the Helpfulness scorer”
  • “Adjust the scorer to be more lenient”
Loop can also tune scorers based on manual labels from the playground.

Best practices

Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer prompts or code. Use multiple scorers: Measure different aspects (factuality, helpfulness, tone) with separate scorers. Test scorers: Run scorers on known examples to verify they behave as expected. Version scorers: Like prompts, scorers are versioned automatically. Track what works. Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than code-based scorers.

Next steps