Custom code scorers

Custom code scorers let you write evaluation logic in TypeScript, Python, or Ruby with full control over the scoring algorithm. They can use any packages you need and are best when you have specific rules, patterns, or calculations to implement. You can define custom code scorers in three places:

Inline in SDK code: Define scorers directly in your evaluation scripts for local development or application-specific logic.
Pushed via CLI: Define scorers in TypeScript or Python files and push them to Braintrust for team-wide sharing and automatic evaluation of production logs.
Created in UI: Build scorers in the Braintrust web interface using the built-in code editor.

Most teams prototype in the UI, then push production-ready scorers via the CLI. See Scorers overview for guidance.

Score spans

Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score. Your scorer function receives these parameters:

input: The input to your task
output: The output from your task
expected: The expected output (optional)
metadata: Custom metadata from the test case

Return a number between 0 and 1, or an object with score and optional metadata. In Ruby, declare only the parameters you need as keyword arguments. The runner automatically filters out the rest: |output:, expected:|.

Use scorers inline in your evaluation code:

equality_scorer.eval.ts

import { Eval, type Scorer } from "braintrust";
import OpenAI from "openai";

const client = new OpenAI();

const DATASET = [
  {
    input: "What is 2+2?",
    expected: "4",
  },
  {
    input: "What is the capital of France?",
    expected: "Paris",
  },
];

async function task(input: string): Promise<string> {
  const response = await client.responses.create({
    model: "gpt-5-mini",
    input: [
      { role: "user", content: input },
    ],
  });
  return response.output_text ?? "";
}

const equalityScorer: Scorer = ({ output, expected }) => {
  if (!expected) return null;
  const matches = output === expected;
  return {
    name: "Equality",
    score: matches ? 1 : 0,
    metadata: { exact_match: matches },
  };
};

const containsScorer: Scorer = ({ output, expected }) => {
  if (!expected) return null;
  const contains = output.toLowerCase().includes(expected.toLowerCase());
  return {
    name: "Contains expected",
    score: contains ? 1 : 0,
  };
};

Eval("Custom Code Scorer Example", {
  data: DATASET,
  task,
  scores: [equalityScorer, containsScorer],
});

Define TypeScript or Python scorers in code and push to Braintrust:

code_scorer.ts

import braintrust from "braintrust";
import { z } from "zod";

const project = braintrust.projects.create({ name: "my-project" });

project.scorers.create({
  name: "Equality scorer",
  slug: "equality-scorer",
  description: "Check if output equals expected",
  parameters: z.object({
    output: z.string(),
    expected: z.string(),
  }),
  handler: async ({ output, expected }) => {
    const matches = output === expected;
    return {
      score: matches ? 1 : 0,
      metadata: { exact_match: matches },
    };
  },
  metadata: {
    __pass_threshold: 0.5,
  },
});

Push to Braintrust:

bt functions push code_scorer.ts

Important notes for Python scorers:

Scorers must be pushed from within their directory (e.g., bt functions push scorer.py); pushing with relative paths (e.g., bt functions push path/to/scorer.py) is unsupported and will cause import errors.
Scorers using local imports must be defined at the project root.
The maximum supported Python version for scorers created with the Braintrust CLI is 3.13.
Braintrust uses uv to cross-bundle dependencies to Linux. This works for binary dependencies except libraries requiring on-demand compilation.

TypeScript bundling

In TypeScript, Braintrust uses esbuild to bundle your code and dependencies. This works for most dependencies but does not support native (compiled) libraries like SQLite.If you have trouble bundling dependencies, file an issue in the braintrust-sdk repo.

Python external dependencies

Python scorers created via the CLI have these default packages:

autoevals
braintrust
openai
pydantic
requests

For additional packages, use the --requirements flag.For scorers with external dependencies:

scorer-with-deps.py

import braintrust
from langdetect import detect
from pydantic import BaseModel

project = braintrust.projects.create(name="my-project")

class LanguageMatchParams(BaseModel):
    output: str
    expected: str

@project.scorers.create(
    name="Language match",
    slug="language-match",
    description="Check if output and expected are same language",
    parameters=LanguageMatchParams,
    metadata={"__pass_threshold": 0.5},
)
def language_match_scorer(output: str, expected: str):
    return 1.0 if detect(output) == detect(expected) else 0.0

Create requirements file:

langdetect==1.0.9

Push with requirements:

bt functions push scorer-with-deps.py --requirements requirements.txt

Go to Scorers > + Scorer.
Enter a scorer name and slug.
Select TypeScript or Python.
Write your scorer function. The code editor provides real-time linting and autocomplete.
Click Save as custom scorer.

function handler({
  input,
  output,
  expected,
  metadata,
}: {
  input: any;
  output: any;
  expected: any;
  metadata: Record<string, any>;
}): number | null {
  if (expected === null) return null;
  return output === expected ? 1 : 0;
}

UI scorers have access to these packages:

anthropic
autoevals
braintrust
json
math
openai
re
requests
typing

For additional packages, use the CLI.

Score traces

Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, overall workflow completion, or when your scorer needs access to the full execution context. The scorer runs once per trace. Your handler function receives the trace parameter, which provides methods for accessing execution data:

Get spans: Returns spans matching the filter. Each span includes input, output, metadata, span_id, and span_attributes. Omit the filter to get all spans, or pass multiple types like ["llm", "tool"].
- TypeScript: trace.getSpans({ spanType: ["llm"] })
- Python: trace.get_spans(span_type=["llm"])
- Ruby: trace.spans(span_type: "llm")
Get thread: Returns an array of conversation messages extracted from LLM spans.
- TypeScript: trace.getThread()
- Python: trace.get_thread()
- Ruby: trace.thread

input, output, expected, and metadata are automatically populated from the root span and passed to your scorer function.

Trace-level scoring requires TypeScript SDK v2.2.1+, Python SDK v0.5.6+, or Ruby SDK v0.2.1+.

Use scorers inline in your evaluation code:

trace_code_scorer.eval.ts

import { Eval, wrapOpenAI, wrapTraced, type Scorer } from "braintrust";
import OpenAI from "openai";

const client = wrapOpenAI(new OpenAI());

const SUPPORT_DATASET = [
  { input: "My order hasn't arrived yet. Order #12345." },
  { input: "I need help resetting my password." },
];

const callLLM = wrapTraced(async function callLLM(messages: Array<{ role: string; content: string }>) {
  const response = await client.chat.completions.create({
    model: "gpt-5-mini",
    messages,
  });
  return response.choices[0].message.content || "";
});

async function supportTask(input: string): Promise<string> {
  const messages: Array<{ role: string; content: string }> = [
    { role: "system", content: "You are a helpful customer support agent." }
  ];

  messages.push({ role: "user", content: input });
  const response1 = await callLLM(messages);
  messages.push({ role: "assistant", content: response1 });

  messages.push({ role: "user", content: "Can you provide more details?" });
  const response2 = await callLLM(messages);
  messages.push({ role: "assistant", content: response2 });

  messages.push({ role: "user", content: "Thank you for your help!" });
  const response3 = await callLLM(messages);

  return response3;
}

const politenessScorer: Scorer = async ({ trace }) => {
  if (!trace) return 0;

  const thread = await trace.getThread();
  const lastAssistantMsg = thread.reverse().find(msg => msg.role === "assistant");
  const content = lastAssistantMsg?.content?.toLowerCase() || "";

  const politeWords = ["welcome", "glad", "happy", "pleasure", "thank"];
  const isPolite = politeWords.some(word => content.includes(word));

  return {
    name: "Politeness",
    score: isPolite ? 1 : 0,
    metadata: { checked_message_preview: content.slice(0, 80) },
  };
};

const efficiencyScorer: Scorer = async ({ trace }) => {
  if (!trace) return 0;

  const llmSpans = await trace.getSpans({ spanType: ["llm"] });
  const isEfficient = llmSpans.length >= 3 && llmSpans.length <= 5;

  return {
    name: "Efficiency",
    score: isEfficient ? 1 : 0,
    metadata: { llm_calls: llmSpans.length },
  };
};

Eval("Support Quality", {
  data: SUPPORT_DATASET,
  task: supportTask,
  scores: [politenessScorer, efficiencyScorer],
});

Define TypeScript or Python scorers in code and push to Braintrust:

trace_code_scorer.ts

import braintrust from "braintrust";
import { z } from "zod";

const project = braintrust.projects.create({ name: "my-project" });

project.scorers.create({
  name: "Politeness scorer",
  slug: "politeness-scorer",
  description: "Check if assistant responds politely",
  parameters: z.object({
    trace: z.any(),
  }),
  handler: async ({ trace }) => {
    if (!trace) return 0;

    const thread = await trace.getThread();
    const lastAssistantMsg = thread.reverse().find(msg => msg.role === "assistant");
    const content = lastAssistantMsg?.content?.toLowerCase() || "";

    const politeWords = ["welcome", "glad", "happy", "pleasure", "thank"];
    const isPolite = politeWords.some(word => content.includes(word));

    return {
      score: isPolite ? 1 : 0,
      metadata: { checked_message_preview: content.slice(0, 80) },
    };
  },
});

project.scorers.create({
  name: "Efficiency scorer",
  slug: "efficiency-scorer",
  description: "Check if conversation was efficient",
  parameters: z.object({
    trace: z.any(),
  }),
  handler: async ({ trace }) => {
    if (!trace) return 0;

    const llmSpans = await trace.getSpans({ spanType: ["llm"] });
    const isEfficient = llmSpans.length >= 3 && llmSpans.length <= 5;

    return {
      score: isEfficient ? 1 : 0,
      metadata: { llm_calls: llmSpans.length },
    };
  },
});

Push to Braintrust:

bt functions push trace_code_scorer.ts

Go to Scorers > + Scorer.
Enter a scorer name and slug.
Select TypeScript or Python.
Write your scorer function with the trace parameter. The code editor provides real-time linting and autocomplete.
Click Save as custom scorer.

import type { Trace } from 'braintrust';

async function handler({
  input,
  output,
  expected,
  metadata,
  trace,
}: {
  input: any;
  output: any;
  expected: any;
  metadata: Record<string, any>;
  trace: Trace;
}): Promise<
  | number
  | { score: number; name?: string; metadata?: Record<string, unknown> }
  | null
> {
  if (expected === null) return null;

  const allSpans = await trace.getSpans();
  const llmSpans = await trace.getSpans({ spanType: ["llm"] });

  return {
    name: "span count scorer",
    score: output === expected ? 1 : 0,
    metadata: {
      totalSpanCount: allSpans.length,
      llmSpanCount: llmSpans.length,
    },
  };
}

UI scorers have access to these packages:

anthropic
autoevals
braintrust
json
math
openai
re
requests
typing

For additional packages, use the CLI.

Set pass thresholds

Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting).

Add __pass_threshold to the scorer’s metadata (value between 0 and 1):

project.scorers.create({
  name: "Quality checker",
  slug: "quality-checker",
  handler: async ({ output, expected }) => {
    return output === expected ? 1 : 0;
  },
  metadata: {
    __pass_threshold: 0.8,
  },
});

Return multiple scores

A single scorer can return an array of score objects to emit multiple named metrics from one call. This is useful when several quality dimensions can be computed together or share computation. Each item appears as its own score column in the Braintrust UI. Each item requires name and score. metadata is optional.

Eval("Summary Quality", {
  data: DATASET,
  task,
  scores: [
    ({ output, expected }) => {
      const words = (output ?? "").toLowerCase().split(/\s+/);
      const keyTerms: string[] = expected.key_terms;
      const covered = keyTerms.filter((t) => words.includes(t)).length;
      return [
        {
          name: "coverage",
          score: keyTerms.length ? covered / keyTerms.length : 1,
          metadata: { missing: keyTerms.filter((t) => !words.includes(t)) },
        },
        {
          name: "conciseness",
          score: words.length <= expected.max_words ? 1 : 0,
          metadata: { word_count: words.length, limit: expected.max_words },
        },
      ];
    },
  ],
});

Next steps

Autoevals for pre-built scorers without writing code
LLM-as-a-judge for natural language evaluation criteria
Run evaluations using your scorers
Score production logs with online scoring rules

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Score spans

Score traces

Set pass thresholds

Return multiple scores

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Score spans

​Score traces

​Set pass thresholds

​Return multiple scores

​Next steps

Score spans

Score traces

Set pass thresholds

Return multiple scores

Next steps