Scorers

Scorers in Braintrust allow you to evaluate the output of LLMs based on a set of criteria. These can include both heuristics (expressed as code) or prompts (expressed as LLM-as-a-judge). Scorers help you assign a performance score between 0 and 100% to assess how well the AI outputs match expected results. While many scorers are available out of the box in Braintrust, you can also create your own custom scorers directly in the UI or upload them via the command line. Scorers can also be used as functions.

Autoevals

There are several pre-built scorers available via the open-source autoevals library, which offers standard evaluation methods that you can start using immediately.

Autoeval scorers offer a strong starting point for a variety of evaluation tasks. Some autoeval scorers require configuration before they can be used effectively. For example, you might need to define expected outputs or certain parameters for specific tasks. To edit an autoeval scorer, you must copy it first.

While autoevals are a great way to get started, you may eventually need to create your own custom scorers for more advanced use cases.

Custom scorers

For more specialized evals, you can create custom scorers in TypeScript, Python, or as an LLM-as-a-judge. Code-based scorers (TypeScript/Python) are highly customizable and can return scores based on your exact requirements, while LLM-as-a-judge scorers use prompts to evaluate outputs.

You can create custom scorers in TypeScript, Python, or as an LLM-as-a-judge either in the Braintrust UI or via the command line using braintrust push. These scorers will be available to use as functions throughout your project.

Create custom scorers via UI

Navigate to Scorers > + Scorer to create custom scorers in the UI.

TypeScript and Python scorers

Add your custom code to the TypeScript or Python tabs. Your scorer will run in a sandboxed environment.

Scorers created via the UI run with these available packages:

  • anthropic
  • asyncio
  • autoevals
  • braintrust
  • json
  • math
  • openai
  • re
  • requests
  • typing

If you need to use packages outside this list, see Create custom scorers via CLI.

Create TypeScript scorer

LLM-as-a-judge scorers

In addition to code-based scorers, you can also create LLM-as-a-judge scorers through the UI. Define a prompt that evaluates the AI's output and maps its choices to specific scores. You can also configure whether to use techniques like chain-of-thought (CoT) reasoning for more complex evaluations.

Create LLM-as-a-judge scorer

Using scorer in playground

The Playground allows you to iterate quickly on prompts while running evaluations, making it the perfect tool for testing and refining your AI models and prompts.

Create custom scorers via CLI

As with tools, when writing custom scorers in the UI, there may be restrictions on certain imports or functionality, but you can always write your scorers in your own environment and upload them for use in Braintrust via braintrust push. This works for both code-based scorers and LLM-as-a-judge scorers.

TypeScript scorers

Both code-based and LLM-as-judge scorers can be written in TypeScript.

Write your scorer:

scorer.ts
import braintrust from "braintrust";
import { z } from "zod";
 
const project = braintrust.projects.create({ name: "scorer" });
 
// Code-based scorer
project.scorers.create({
  name: "Equality scorer",
  slug: "equality-scorer",
  description: "An equality scorer",
  parameters: z.object({
    output: z.string(),
    expected: z.string(),
  }),
  handler: async ({ output, expected }) => {
    return output == expected ? 1 : 0;
  },
});
 
// LLM-as-judge scorer
project.scorers.create({
  name: "Equality LLM scorer",
  slug: "equality-llm-scorer",
  description: "An equality LLM scorer",
  messages: [
    {
      role: "user",
      content:
        'Return "A" if {{output}} is equal to {{expected}}, and "B" otherwise.',
    },
  ],
  model: "gpt-4o",
  useCot: true,
  choiceScores: {
    A: 1,
    B: 0,
  },
});

Then push it to Braintrust:

npx braintrust push scorer.ts

In TypeScript, we use esbuild to bundle your code and its dependencies together. This works for most dependencies, but it does not support native (compiled) libraries like SQLite.

If you have trouble bundling your dependencies, file an issue in the braintrust-sdk repo.

Python scorers

Both code-based and LLM-as-judge scorers can be written in Python.

Python scorers created via the CLI run with these default available packages:

  • autoevals
  • braintrust
  • openai
  • pydantic
  • requests

To use packages beyond these, upload scorers with external dependencies by using the --requirements flag with braintrust push.

Write your scorer:

scorer.py
import braintrust
import pydantic
 
project = braintrust.projects.create(name="scorer")
 
 
# Code-based scorer
class Input(pydantic.BaseModel):
    output: str
    expected: str
 
 
def handler(output: str, expected: str) -> int:
    return 1 if output == expected else 0
 
 
project.scorers.create(
    name="Equality scorer",
    slug="equality-scorer",
    description="An equality scorer",
    parameters=Input,
    handler=handler,
)
 
# LLM-as-judge scorer
project.scorers.create(
    name="Equality LLM scorer",
    slug="equality-llm-scorer",
    description="An equality LLM scorer",
    messages=[
        {
            "role": "user",
            "content": 'Return "A" if {{output}} is equal to {{expected}}, and "B" otherwise.',
        },
    ],
    model="gpt-4o",
    use_cot=True,
    choice_scores={"A": 1, "B": 0},
)

Then push it to Braintrust:

braintrust push scorer.py
  • Scorers must be pushed from within their directory (e.g. braintrust push scorer.py); pushing a scorer with relative paths (e.g. braintrust push path/to/scorer.py) is unsupported and will result in import errors at runtime.
  • Scorers using local imports must be defined at the project root.

To use packages beyond the default available ones, upload scorers with external dependencies by using the --requirements flag with braintrust push, for example:

scorer-with-external-dep.py
import braintrust
from langdetect import detect  # not in default available packages
from pydantic import BaseModel
 
project = braintrust.projects.create(name="scorer")
 
 
class LanguageMatchParams(BaseModel):
    output: str
    expected: str
 
 
project.scorers.create(
    name="Language match",
    slug="language-match",
    description="A same language scorer",
    parameters=LanguageMatchParams,
    handler=lambda output, expected: 1.0 if detect(output) == detect(expected) else 0.0,
)

For scorers with external dependencies, create a requirements file:

requirements.txt
langdetect==1.0.9

Then push it to Braintrust, using the --requirements flag:

braintrust push scorer-with-external-dep.py --requirements requirements.txt

In Python, we use uv to cross-bundle a specified list of dependencies to the target platform (Linux). This works for binary dependencies except for libraries that require on-demand compilation.

If you have trouble bundling your dependencies, file an issue in the braintrust-sdk repo.

Using a scorer in the UI

You can use both autoevals and custom scorers in a Braintrust playground. In your playground, navigate to Scorers and select from the list of available scorers. You can also create a new custom scorer from this menu.

Using scorer in playground

The playground allows you to iterate quickly on prompts while running evaluations, making it the perfect tool for testing and refining your AI models and prompts.

Using scorers in your evals

The scorers that you create in Braintrust are available throughout the UI, e.g. in the playground, but you can also use them in your code-based evals. See Using custom prompts/functions from Braintrust for more details.

On this page

Scorers - Docs - Guides - Functions - Braintrust