Skip to main content
Scorers evaluate AI outputs by assigning scores between 0 and 100%. Use pre-built scorers from autoevals, write custom code scorers, or build LLM-as-a-judge scorers to measure what matters for your application.

Scorer types

Braintrust offers three types of scorers:
  • Autoevals: Pre-built, battle-tested scorers for common tasks like factuality checking, semantic similarity, and format validation. Start here for standard evaluation needs.
  • LLM-as-a-judge: Use a language model to evaluate outputs based on natural language criteria. Best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in code.
  • Custom code: Write custom evaluation logic in TypeScript or Python. Best when you have specific rules, patterns, or calculations to implement. Custom code scorers can evaluate either the final output or the entire execution trace for multi-step workflows.

Create scorers

Autoevals

Pre-built, battle-tested scorers for common evaluation tasks. Autoevals are open-source, deterministic (where possible), and optimized for speed and reliability.
Import autoevals and use them directly in evaluations:
import { Eval } from "braintrust";
import { Factuality, Levenshtein, Semantic } from "autoevals";

Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality, Levenshtein, Semantic],
});
Autoevals automatically receive these parameters when used in evaluations:
  • input: The input to your task
  • output: The output from your task
  • expected: The expected output (optional)
  • metadata: Custom metadata from the test case
Available scorers:
  • Factuality: Check if output contains factual information
  • Semantic: Measure semantic similarity to expected output
  • Levenshtein: Calculate edit distance from expected output
  • JSON: Validate JSON structure and content
  • SQL: Validate SQL query syntax and semantics
See the autoevals library for the complete list.

LLM-as-a-judge

Use a language model to evaluate outputs based on natural language criteria. The model rates outputs and maps its choices to numeric scores.
Define LLM-as-a-judge scorers in code and push to Braintrust:
scorer.ts
import braintrust from "braintrust";

const project = braintrust.projects.create({ name: "my-project" });

project.scorers.create({
  name: "Helpfulness scorer",
  slug: "helpfulness-scorer",
  description: "Evaluate helpfulness of response",
  messages: [
    {
      role: "user",
      content:
        'Rate the helpfulness of this response: {{output}}\n\nReturn "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.',
    },
  ],
  model: "gpt-4o",
  useCot: true,
  choiceScores: {
    A: 1,
    B: 0.5,
    C: 0,
  },
  metadata: {
    __pass_threshold: 0.7,
  },
});
Push to Braintrust:
npx braintrust push scorer.ts
braintrust push scorer.py
Your prompt template can reference these variables:
  • {{input}}: The input to your task
  • {{output}}: The output from your task
  • {{expected}}: The expected output (optional)
  • {{metadata}}: Custom metadata from the test case

Custom code

Write custom evaluation logic in TypeScript or Python to evaluate outputs. Custom code scorers give you full control over the evaluation logic and can use any packages you need.
output-scorer.ts
import braintrust from "braintrust";
import { z } from "zod";

const project = braintrust.projects.create({ name: "my-project" });

project.scorers.create({
  name: "Equality scorer",
  slug: "equality-scorer",
  description: "Check if output equals expected",
  parameters: z.object({
    output: z.string(),
    expected: z.string(),
  }),
  handler: async ({ output, expected }) => {
    return output === expected ? 1 : 0;
  },
  metadata: {
    __pass_threshold: 0.5,
  },
});
Push to Braintrust:
npx braintrust push scorer.ts
braintrust push scorer.py
Your handler function receives these parameters:
  • input: The input to your task
  • output: The output from your task
  • expected: The expected output (optional)
  • metadata: Custom metadata from the test case
Return a number between 0 and 1, or an object with score and optional metadata:
// Simple return
return 0.85;

// With metadata
return {
  score: 0.85,
  metadata: { reason: "Good factuality, minor tone issues" },
};
Important notes for Python scorers:
  • Scorers must be pushed from within their directory (e.g., braintrust push scorer.py); pushing with relative paths (e.g., braintrust push path/to/scorer.py) is unsupported and will cause import errors.
  • Scorers using local imports must be defined at the project root.
  • Braintrust uses uv to cross-bundle dependencies to Linux. This works for binary dependencies except libraries requiring on-demand compilation.
In TypeScript, Braintrust uses esbuild to bundle your code and dependencies. This works for most dependencies but does not support native (compiled) libraries like SQLite.If you have trouble bundling dependencies, file an issue in the braintrust-sdk repo.
Python scorers created via the CLI have these default packages:
  • autoevals
  • braintrust
  • openai
  • pydantic
  • requests
For additional packages, use the --requirements flag.For scorers with external dependencies:
scorer-with-deps.py
import braintrust
from langdetect import detect  # External package
from pydantic import BaseModel

project = braintrust.projects.create(name="my-project")

class LanguageMatchParams(BaseModel):
    output: str
    expected: str

@project.scorers.create(
    name="Language match",
    slug="language-match",
    description="Check if output and expected are same language",
    parameters=LanguageMatchParams,
    metadata={"__pass_threshold": 0.5},
)
def language_match_scorer(output: str, expected: str):
    return 1.0 if detect(output) == detect(expected) else 0.0
Create requirements file:
langdetect==1.0.9
Push with requirements:
braintrust push scorer-with-deps.py --requirements requirements.txt

Set pass thresholds

Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting).
Add __pass_threshold to the scorer’s metadata (value between 0 and 1):
metadata: {
  __pass_threshold: 0.7,  // Scores below 0.7 are considered failures
}
Example with a custom code scorer:
project.scorers.create({
  name: "Quality checker",
  slug: "quality-checker",
  handler: async ({ output, expected }) => {
    return output === expected ? 1 : 0;
  },
  metadata: {
    __pass_threshold: 0.8,
  },
});

Test scorers

Scorers need to be developed iteratively against real data. When creating or editing a scorer in the UI, use the Run section to test your scorer with data from different sources. Each variable source populates the scorer’s input parameters (like input, output, expected, metadata) from a different location.

Test with manual input

Best for initial development when you have a specific example in mind. Use this to quickly prototype and verify basic scorer logic before testing on larger datasets.
  1. Select Editor in the Run section.
  2. Enter values for input, output, expected, and metadata fields.
  3. Click Test to see how your scorer evaluates the example
  4. Iterate on your scorer logic based on the results

Test with a dataset

Best for testing specific scenarios, edge cases, or regression testing. Use this when you want controlled, repeatable test cases or need to ensure your scorer handles specific situations correctly.
  1. Select Dataset in the Run section.
  2. Choose a dataset from your project.
  3. Select a record to test with.
  4. Click Test to see how your scorer evaluates the example.
  5. Review results to identify patterns and edge cases.

Test with logs

Best for testing against actual usage patterns and debugging real-world edge cases. Use this when you want to see how your scorer performs on data your system is actually generating.
  1. Select Logs in the Run section.
  2. Select the project containing the logs you want to test against.
  3. Filter logs to find relevant examples:
    • Click Add filter and choose just root spans, specific span names, or a more advanced filter based on specific input, output, metadata, or other values.
    • Select a timeframe.
  4. Click Test to see how your scorer evaluates real production data.
  5. Identify cases where the scorer needs adjustment for real-world scenarios.
To create a new online scoring rule with the filters automatically prepopulated from your current log filters, click Online scoring. This enables rapid iteration from logs to scoring rules. See Create scoring rules for more details.

Scorer permissions

Both LLM-as-a-judge scorers and custom code scorers automatically receive a BRAINTRUST_API_KEY environment variable that allows them to:
  • Make LLM calls using organization and project AI secrets
  • Access attachments from the current project
  • Read and write logs to the current project
  • Read prompts from the organization
For custom code scorers that need expanded permissions beyond the current project (such as logging to other projects, reading datasets, or accessing other organization data), you can provide your own API key using the PUT /v1/env_var endpoint.

Optimize with Loop

Generate and improve scorers using Loop: Example queries:
  • “Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
  • “Generate a code-based scorer based on project logs”
  • “Optimize the Helpfulness scorer”
  • “Adjust the scorer to be more lenient”
Loop can also tune scorers based on manual labels from the playground.

Best practices

Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer prompts or code. Use multiple scorers: Measure different aspects (factuality, helpfulness, tone) with separate scorers. Choose the right scope: Use trace scorers (custom code with trace parameter) for multi-step workflows and agents. Use output scorers for simple quality checks. Test scorers: Run scorers on known examples to verify they behave as expected. Version scorers: Like prompts, scorers are versioned automatically. Track what works. Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than custom code scorers.

Next steps