Write effective scorers

Scorers evaluate AI outputs by assigning scores between 0 and 100%. Use pre-built scorers from autoevals, create custom code-based scorers, or build LLM-as-a-judge scorers to measure what matters for your application.

Use autoevals

The autoevals library provides pre-built scorers for common evaluation tasks:

import { Factuality, Levenshtein, Semantic } from "autoevals";

from autoevals import Factuality, Levenshtein, Semantic

Popular autoevals scorers:

Factuality: Check if output contains factual information
Semantic: Measure semantic similarity to expected output
Levenshtein: Calculate edit distance from expected output
JSON: Validate JSON structure and content
SQL: Validate SQL query syntax and semantics

Use autoevals directly in evaluations:

import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],
});

Create custom scorers

For specialized evaluation, create custom scorers in TypeScript, Python, or as LLM-as-a-judge.

Navigate to Scorers > + Scorer to create scorers in the UI.

Code-based scorers

Write TypeScript or Python code that evaluates outputs:

UI scorers have access to these packages:

anthropic
autoevals
braintrust
json
math
openai
re
requests
typing

For additional packages, use the SDK method below.

LLM-as-a-judge scorers

Define prompts that evaluate outputs and map choices to scores:

Configure:

Prompt: Instructions for evaluating the output
Model: Which model to use as judge
Choice scores: Map model choices (A, B, C) to numeric scores
Use CoT: Enable chain-of-thought reasoning for complex evaluations

Write scorers in your environment with full package access:

TypeScript scorers

scorer.ts

import braintrust from "braintrust";
import { z } from "zod";

const project = braintrust.projects.create({ name: "my-project" });

// Code-based scorer
project.scorers.create({
  name: "Equality scorer",
  slug: "equality-scorer",
  description: "Check if output equals expected",
  parameters: z.object({
    output: z.string(),
    expected: z.string(),
  }),
  handler: async ({ output, expected }) => {
    return output === expected ? 1 : 0;
  },
  metadata: {
    __pass_threshold: 0.5,
  },
});

// LLM-as-judge scorer
project.scorers.create({
  name: "Helpfulness scorer",
  slug: "helpfulness-scorer",
  description: "Evaluate helpfulness of response",
  messages: [
    {
      role: "user",
      content:
        'Rate the helpfulness of this response: {{output}}\n\nReturn "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.',
    },
  ],
  model: "gpt-4o",
  useCot: true,
  choiceScores: {
    A: 1,
    B: 0.5,
    C: 0,
  },
  metadata: {
    __pass_threshold: 0.7,
  },
});

Python scorers

scorer.py

import braintrust
from pydantic import BaseModel

project = braintrust.projects.create(name="my-project")

# Code-based scorer
class EqualityParams(BaseModel):
    output: str
    expected: str

@project.scorers.create(
    name="Equality scorer",
    slug="equality-scorer",
    description="Check if output equals expected",
    parameters=EqualityParams,
    metadata={"__pass_threshold": 0.5},
)
def equality_scorer(output: str, expected: str):
    return 1 if output == expected else 0

# LLM-as-judge scorer
project.scorers.create(
    name="Helpfulness scorer",
    slug="helpfulness-scorer",
    description="Evaluate helpfulness of response",
    messages=[
        {
            "role": "user",
            "content": 'Rate the helpfulness of this response: {{output}}\n\nReturn "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.',
        }
    ],
    model="gpt-4o",
    use_cot=True,
    choice_scores={
        "A": 1,
        "B": 0.5,
        "C": 0,
    },
    metadata={"__pass_threshold": 0.7},
)

Push to Braintrust:

npx braintrust push scorer.ts

braintrust push scorer.py

Python external dependencies

Python scorers created via the CLI have these default packages:

autoevals
braintrust
openai
pydantic
requests

For additional packages, use the --requirements flag.

For scorers with external dependencies:

scorer-with-deps.py

import braintrust
from langdetect import detect  # External package
from pydantic import BaseModel

project = braintrust.projects.create(name="my-project")

class LanguageMatchParams(BaseModel):
    output: str
    expected: str

project.scorers.create(
    name="Language match",
    slug="language-match",
    description="Check if output and expected are same language",
    parameters=LanguageMatchParams,
    handler=lambda output, expected: 1.0 if detect(output) == detect(expected) else 0.0,
    metadata={"__pass_threshold": 0.5},
)

Create requirements file:

langdetect==1.0.9

Push with requirements:

braintrust push scorer-with-deps.py --requirements requirements.txt

Important notes for Python scorers:

Scorers must be pushed from within their directory (e.g., braintrust push scorer.py); pushing with relative paths (e.g., braintrust push path/to/scorer.py) is unsupported and will cause import errors.
Scorers using local imports must be defined at the project root.
Braintrust uses uv to cross-bundle dependencies to Linux. This works for binary dependencies except libraries requiring on-demand compilation.

TypeScript bundling

In TypeScript, Braintrust uses esbuild to bundle your code and dependencies. This works for most dependencies but does not support native (compiled) libraries like SQLite.If you have trouble bundling dependencies, file an issue in the braintrust-sdk repo.

Scorer parameters

Scorers receive these parameters:

input: The input to your task
output: The output from your task
expected: The expected output (optional)
metadata: Custom metadata from the test case

Return a number between 0 and 1, or an object with score and optional metadata.

Set pass thresholds

Define minimum acceptable scores using __pass_threshold in metadata (value between 0 and 1):

metadata: {
  __pass_threshold: 0.7,  // Scores below 0.7 are considered failures
}

You can also set pass thresholds when creating or editing scorers in the UI using the threshold slider. When a scorer has a pass threshold configured:

Scores that meet or exceed the threshold are marked as passing and displayed with green highlighting and a checkmark
Scores below the threshold are marked as failing and displayed with red highlighting

This visual feedback makes it easy to scan evaluation results and identify which outputs meet your quality criteria at a glance.

Optimize with Loop

Generate and improve scorers using Loop: Example queries:

“Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
“Generate a code-based scorer based on project logs”
“Optimize the Helpfulness scorer”
“Adjust the scorer to be more lenient”

Loop can also tune scorers based on manual labels from the playground.

Best practices

Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer prompts or code. Use multiple scorers: Measure different aspects (factuality, helpfulness, tone) with separate scorers. Test scorers: Run scorers on known examples to verify they behave as expected. Version scorers: Like prompts, scorers are versioned automatically. Track what works. Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than code-based scorers.

Next steps

Run evaluations using your scorers
Interpret results to understand scores
Write prompts to guide model behavior
Use playgrounds to test scorers interactively

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Write effective scorers

Use autoevals

Create custom scorers

Code-based scorers

LLM-as-a-judge scorers

TypeScript scorers

Python scorers

Python external dependencies

TypeScript bundling

Scorer parameters

Set pass thresholds

Optimize with Loop

Best practices

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Use autoevals

​Create custom scorers

​Code-based scorers

​LLM-as-a-judge scorers

​TypeScript scorers

​Python scorers

​Python external dependencies

​TypeScript bundling

​Scorer parameters

​Set pass thresholds

​Optimize with Loop

​Best practices

​Next steps

Use autoevals

Create custom scorers

Code-based scorers

LLM-as-a-judge scorers

TypeScript scorers

Python scorers

Python external dependencies

TypeScript bundling

Scorer parameters

Set pass thresholds

Optimize with Loop

Best practices

Next steps