Create scorers

Scorers evaluate AI output quality by assigning scores between 0 and 1 based on criteria you define like factual accuracy, helpfulness, or correct formatting.

Overview

Braintrust offers three types of scorers:

Autoevals - Pre-built, battle-tested scorers for common evaluation tasks like factuality checking, semantic similarity, and format validation. Best for standard evaluation needs where reliable scorers already exist.
LLM-as-a-judge - Use language models to evaluate outputs based on natural language criteria and instructions. Best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in deterministic code.
Custom code - Write custom evaluation logic in TypeScript or Python with full control over the scoring algorithm. Best for specific business rules, pattern matching, or calculations unique to your use case.

You can define scorers in three places:

Inline in SDK code - Define scorers directly in your evaluation scripts for local development, access to complex dependencies, or application-specific logic that’s tightly coupled to your codebase.
Pushed via CLI - Define scorers in code files and push them to Braintrust for version control in Git, team-wide sharing across projects, and automatic evaluation of production logs.
Created in UI - Build scorers in the Braintrust web interface for non-technical users to create evaluations, rapid prototyping of scoring ideas, and simple LLM-as-a-judge scorers.

Most teams prototype in the UI, develop complex scorers inline, then push production-ready scorers to Braintrust for team-wide use.

Score with autoevals

The autoevals library provides pre-built, battle-tested scorers for common evaluation tasks like factuality checking, semantic similarity, and format validation. Autoevals are open-source, deterministic (where possible), and optimized for speed and reliability. They can evaluate individual spans, but not entire traces. Available scorers include:

Factuality: Check if output contains factual information
Semantic: Measure semantic similarity to expected output
Levenshtein: Calculate edit distance from expected output
JSON: Validate JSON structure and content
SQL: Validate SQL query syntax and semantics

See the autoevals library for the complete list.

Use scorers inline in your evaluation code:

import { Eval, initDataset } from "braintrust";
import { Factuality } from "autoevals";

Eval("My Project", {
  experimentName: "My experiment",
  data: initDataset("My Project", { dataset: "My Dataset" }),
  task: async (input) => {
    // Your LLM call here
    return await callModel(input);
  },
  scores: [Factuality],
  metadata: {
    model: "gpt-5-mini",
  },
});

Autoevals automatically receive these parameters when used in evaluations:

input: The input to your task
output: The output from your task
expected: The expected output (optional)
metadata: Custom metadata from the test case

Score with LLMs

LLM-as-a-judge scorers use a language model to evaluate based on natural language criteria. They are best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in code. They can evaluate individual spans or entire traces.

Score spans
Score traces

Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score.Your prompt template can reference these variables:

{{input}}: The input to your task
{{output}}: The output from your task
{{expected}}: The expected output (optional)
{{metadata}}: Custom metadata from the test case

Use scorers inline in your evaluation code:

llm_scorer.eval.ts

import { Eval } from "braintrust";
import { LLMClassifierFromTemplate } from "autoevals";
import OpenAI from "openai";

const client = new OpenAI();

// Inline dataset: movie descriptions and expected titles
const MOVIE_DATASET = [
  {
    input:
      "A detective investigates a series of murders based on the seven deadly sins.",
    expected: "Se7en",
  },
  {
    input:
      "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O.",
    expected: "Inception",
  },
];

async function task(input: string): Promise<string> {
  const response = await client.responses.create({
    model: "gpt-5-mini",
    input: [
      {
        role: "system",
        content:
          "Based on the following description, identify the movie. Reply with only the movie title.",
      },
      { role: "user", content: input },
    ],
  });
  return response.output_text ?? "";
}

// LLM-as-judge scorer using LLMClassifier template
const correctnessScorer = LLMClassifierFromTemplate({
  name: "Correctness",
  promptTemplate: `You are evaluating a movie-identification task.

Output (model's answer): {{output}}
Expected (correct movie): {{expected}}

Does the output correctly identify the same movie as the expected answer?
Consider alternate titles (e.g. "Harry Potter 1" vs "Harry Potter and the Sorcerer's Stone") as correct.

Return only "correct" if the output is the right movie (exact or equivalent title).
Return only "incorrect" otherwise.`,
  choiceScores: {
    correct: 1,
    incorrect: 0,
  },
  useCoT: true,
});

Eval("Movie Matcher", {
  data: MOVIE_DATASET,
  task,
  scores: [correctnessScorer],
});

Define scorers in code and push to Braintrust:

llm_scorer.ts

import braintrust from "braintrust";

const project = braintrust.projects.create({ name: "my-project" });

project.scorers.create({
  name: "Helpfulness scorer",
  slug: "helpfulness-scorer",
  description: "Evaluate helpfulness of response",
  messages: [
    {
      role: "user",
      content:
        'Rate the helpfulness of this response: {{output}}\n\nReturn "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.',
    },
  ],
  model: "gpt-5-mini",
  useCot: true,
  choiceScores: {
    A: 1,
    B: 0.5,
    C: 0,
  },
  metadata: {
    __pass_threshold: 0.7,
  },
});

Push to Braintrust:

npx braintrust push llm_scorer.ts

braintrust push llm_scorer.py

Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, overall workflow completion, or when your scorer needs access to the full execution context. The scorer runs once per trace.Your prompt template can reference the {{thread}} template variable, which provides the full conversation formatted as human-readable text. input, output, expected, and metadata are automatically populated from the root span of the trace.

Trace-level scoring requires TypeScript v2.2.1+ or Python SDK v0.5.6+.

Use scorers inline in your evaluation code:

trace_llm_scorer.eval.ts

import { Eval, wrapOpenAI, wrapTraced, type Scorer } from "braintrust";
import OpenAI from "openai";

const client = new OpenAI();
const wrappedClient = wrapOpenAI(new OpenAI());

// Customer support dataset
const SUPPORT_DATASET = [
  { input: "My order hasn't arrived yet. Order #12345." },
  { input: "I need help resetting my password." },
];

// Helper function to call the LLM (creates an LLM span)
const callLLM = wrapTraced(async function callLLM(messages: Array<{ role: string; content: string }>) {
  const response = await wrappedClient.chat.completions.create({
    model: "gpt-5-mini",
    messages,
  });
  return response.choices[0].message.content || "";
});

// Multi-turn customer support conversation
async function supportTask(input: string): Promise<string> {
  const messages: Array<{ role: string; content: string }> = [
    { role: "system", content: "You are a helpful customer support agent." }
  ];

  // Turn 1: Customer's initial question
  messages.push({ role: "user", content: input });
  const response1 = await callLLM(messages);
  messages.push({ role: "assistant", content: response1 });

  // Turn 2: Customer asks for clarification
  messages.push({ role: "user", content: "Can you provide more details?" });
  const response2 = await callLLM(messages);
  messages.push({ role: "assistant", content: response2 });

  // Turn 3: Customer thanks the agent
  messages.push({ role: "user", content: "Thank you for your help!" });
  const response3 = await callLLM(messages);

  return response3;
}

// LLM-as-judge scorer: evaluates conversation coherence using {{thread}}
const conversationCoherence: Scorer = async ({ trace }) => {
  if (!trace) return null;

  // Get the conversation thread (this is what {{thread}} provides)
  const thread = await trace.getThread();
  const threadText = thread
    .map(msg => `${msg.role}: ${msg.content}`)
    .join("\n\n");

  const response = await client.responses.create({
    model: "gpt-5-mini",
    input: [
      {
        role: "user",
        content: `Evaluate the coherence of this customer support conversation:

${threadText}

Rate the conversation coherence:
- "A" for highly coherent with natural flow and consistent context
- "B" for mostly coherent with minor gaps or context issues
- "C" for incoherent, disjointed, or lost context

Return only the letter (A, B, or C).`,
      },
    ],
  });

  const rating = response.output_text?.trim().toUpperCase() || "C";
  const choiceScores = { A: 1, B: 0.6, C: 0 };
  const score = choiceScores[rating as keyof typeof choiceScores] ?? 0;

  return {
    name: "Conversation coherence",
    score,
    metadata: { rating, thread_length: thread.length },
  };
};

Eval("Support Conversation Quality", {
  data: SUPPORT_DATASET,
  task: supportTask,
  scores: [conversationCoherence],
});

Define scorers in code and push to Braintrust:

trace_llm_scorer.ts

import braintrust from "braintrust";
import { z } from "zod";

const project = braintrust.projects.create({ name: "my-project" });

project.scorers.create({
  name: "Conversation coherence",
  slug: "conversation-coherence",
  description: "Evaluate multi-turn conversation coherence",
  parameters: z.object({
    trace: z.any(),
  }),
  messages: [
    {
      role: "user",
      content: `Evaluate the coherence of this conversation:

{{thread}}

Rate the coherence:
- "A" for highly coherent with natural flow
- "B" for mostly coherent with minor gaps
- "C" for incoherent or disjointed`,
    },
  ],
  model: "gpt-5-mini",
  useCot: true,
  choiceScores: {
    A: 1,
    B: 0.6,
    C: 0,
  },
});

Push to Braintrust:

npx braintrust push trace_llm_scorer.ts

braintrust push trace_llm_scorer.py

Create trace-level scorers in the Braintrust UI:

Go to Scorers > + Scorer.
Enter a scorer name and slug.
Select LLM-as-a-judge.
Set Scope to Trace to evaluate entire traces.
Configure:
- Prompt: Use the {{thread}} variable to reference the conversation thread. Instructions for evaluating the trace.
- Model: Which model to use as judge
- Choice scores: Map model choices (A, B, C) to numeric scores
- Use CoT: Enable chain-of-thought reasoning for complex evaluations
Click Save as custom scorer.

Score with custom code

Write custom evaluation logic in TypeScript or Python. Custom code scorers give you full control over the evaluation logic and can use any packages you need. They are best when you have specific rules, patterns, or calculations to implement. Custom code scorers can evaluate individual spans or entire traces.

Score spans
Score traces

input: The input to your task
output: The output from your task
expected: The expected output (optional)
metadata: Custom metadata from the test case

Return a number between 0 and 1, or an object with score and optional metadata.

Use scorers inline in your evaluation code:

equality_scorer.eval.ts

import { Eval, type Scorer } from "braintrust";
import OpenAI from "openai";

const client = new OpenAI();

// Inline dataset
const DATASET = [
  {
    input: "What is 2+2?",
    expected: "4",
  },
  {
    input: "What is the capital of France?",
    expected: "Paris",
  },
];

async function task(input: string): Promise<string> {
  const response = await client.responses.create({
    model: "gpt-5-mini",
    input: [
      { role: "user", content: input },
    ],
  });
  return response.output_text ?? "";
}

// Custom code scorer: checks exact match
const equalityScorer: Scorer = ({ output, expected }) => {
  if (!expected) return null;
  const matches = output === expected;
  return {
    name: "Equality",
    score: matches ? 1 : 0,
    metadata: { exact_match: matches },
  };
};

// Custom code scorer: checks if output contains expected substring
const containsScorer: Scorer = ({ output, expected }) => {
  if (!expected) return null;
  const contains = output.toLowerCase().includes(expected.toLowerCase());
  return {
    name: "Contains expected",
    score: contains ? 1 : 0,
  };
};

Eval("Custom Code Scorer Example", {
  data: DATASET,
  task,
  scores: [equalityScorer, containsScorer],
});

Define scorers in code and push to Braintrust:

code_scorer.ts

import braintrust from "braintrust";
import { z } from "zod";

const project = braintrust.projects.create({ name: "my-project" });

project.scorers.create({
  name: "Equality scorer",
  slug: "equality-scorer",
  description: "Check if output equals expected",
  parameters: z.object({
    output: z.string(),
    expected: z.string(),
  }),
  handler: async ({ output, expected }) => {
    const matches = output === expected;
    return {
      score: matches ? 1 : 0,
      metadata: { exact_match: matches },
    };
  },
  metadata: {
    __pass_threshold: 0.5,
  },
});

Push to Braintrust:

npx braintrust push code_scorer.ts

braintrust push code_scorer.py

Important notes for Python scorers:

Scorers must be pushed from within their directory (e.g., braintrust push scorer.py); pushing with relative paths (e.g., braintrust push path/to/scorer.py) is unsupported and will cause import errors.
Scorers using local imports must be defined at the project root.
Braintrust uses uv to cross-bundle dependencies to Linux. This works for binary dependencies except libraries requiring on-demand compilation.

TypeScript bundling

In TypeScript, Braintrust uses esbuild to bundle your code and dependencies. This works for most dependencies but does not support native (compiled) libraries like SQLite.If you have trouble bundling dependencies, file an issue in the braintrust-sdk repo.

Python external dependencies

Python scorers created via the CLI have these default packages:

autoevals
braintrust
openai
pydantic
requests

For additional packages, use the --requirements flag.For scorers with external dependencies:

scorer-with-deps.py

import braintrust
from langdetect import detect  # External package
from pydantic import BaseModel

project = braintrust.projects.create(name="my-project")

class LanguageMatchParams(BaseModel):
    output: str
    expected: str

@project.scorers.create(
    name="Language match",
    slug="language-match",
    description="Check if output and expected are same language",
    parameters=LanguageMatchParams,
    metadata={"__pass_threshold": 0.5},
)
def language_match_scorer(output: str, expected: str):
    return 1.0 if detect(output) == detect(expected) else 0.0

Create requirements file:

langdetect==1.0.9

Push with requirements:

braintrust push scorer-with-deps.py --requirements requirements.txt

Create scorers in the Braintrust UI:

Go to Scorers > + Scorer.
Enter a scorer name and slug.
Select TypeScript or Python.
Write your scorer function. The code editor provides real-time linting and autocomplete to help you write correct code faster.
Click Save as custom scorer.

// Enter handler function that returns a numeric score between 0 and 1,
// or null to skip scoring
function handler({
  input,
  output,
  expected,
  metadata,
}: {
  input: any;
  output: any;
  expected: any;
  metadata: Record<string, any>;
}): number | null {
  if (expected === null) return null;
  return output === expected ? 1 : 0;
}

UI scorers have access to these packages:

anthropic
autoevals
braintrust
json
math
openai
re
requests
typing

For additional packages, use the SDK tab.

Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, overall workflow completion, or when your scorer needs access to the full execution context. The scorer runs once per trace.Your handler function receives the trace parameter, which provides methods for accessing execution data:

Get spans: Returns spans matching the filter. Each span includes input, output, metadata, span_id, and span_attributes. Omit the filter to get all spans, or pass multiple types like ["llm", "tool"].
- TypeScript: trace.getSpans({ spanType: ["llm"] })
- Python: trace.get_spans(span_type=["llm"])
Get thread: Returns an array of conversation messages extracted from LLM spans. Use for evaluating conversation quality and multi-turn interactions.
- TypeScript: trace.getThread()
- Python: trace.get_thread()

When using trace-level scorers, input, output, expected, and metadata are automatically populated from the root span of the trace and passed to your scorer function, allowing you to evaluate both the trace structure and the root span’s data without additional queries.

Trace-level scoring requires TypeScript v2.2.1+ or Python SDK v0.5.6+.

Use scorers inline in your evaluation code:

trace_code_scorer.eval.ts

import { Eval, wrapOpenAI, wrapTraced, type Scorer } from "braintrust";
import OpenAI from "openai";

const client = wrapOpenAI(new OpenAI());

// Customer support dataset
const SUPPORT_DATASET = [
  { input: "My order hasn't arrived yet. Order #12345." },
  { input: "I need help resetting my password." },
];

// Helper function to call the LLM (creates an LLM span)
const callLLM = wrapTraced(async function callLLM(messages: Array<{ role: string; content: string }>) {
  const response = await client.chat.completions.create({
    model: "gpt-5-mini",
    messages,
  });
  return response.choices[0].message.content || "";
});

// Multi-turn customer support conversation
async function supportTask(input: string): Promise<string> {
  const messages: Array<{ role: string; content: string }> = [
    { role: "system", content: "You are a helpful customer support agent." }
  ];

  // Turn 1: Customer's initial question
  messages.push({ role: "user", content: input });
  const response1 = await callLLM(messages);
  messages.push({ role: "assistant", content: response1 });

  // Turn 2: Customer asks for clarification
  messages.push({ role: "user", content: "Can you provide more details?" });
  const response2 = await callLLM(messages);
  messages.push({ role: "assistant", content: response2 });

  // Turn 3: Customer thanks the agent
  messages.push({ role: "user", content: "Thank you for your help!" });
  const response3 = await callLLM(messages);

  return response3;
}

// Scorer: Check if assistant responds politely using the conversation thread
const politenessScorer: Scorer = async ({ trace }) => {
  if (!trace) return 0;

  // Get the full conversation as an array of messages
  const thread = await trace.getThread();

  // Check the last assistant message for polite language
  const lastAssistantMsg = thread.reverse().find(msg => msg.role === "assistant");
  const content = lastAssistantMsg?.content?.toLowerCase() || "";

  const politeWords = ["welcome", "glad", "happy", "pleasure", "thank"];
  const isPolite = politeWords.some(word => content.includes(word));

  return {
    name: "Politeness",
    score: isPolite ? 1 : 0,
    metadata: { checked_message_preview: content.slice(0, 80) },
  };
};

// Scorer: Check conversation efficiency by analyzing all spans
const efficiencyScorer: Scorer = async ({ trace }) => {
  if (!trace) return 0;

  // Get all LLM spans to count how many calls were made
  const llmSpans = await trace.getSpans({ spanType: ["llm"] });

  // Efficient conversations should resolve in 3-5 LLM calls
  const isEfficient = llmSpans.length >= 3 && llmSpans.length <= 5;

  return {
    name: "Efficiency",
    score: isEfficient ? 1 : 0,
    metadata: { llm_calls: llmSpans.length },
  };
};

Eval("Support Quality", {
  data: SUPPORT_DATASET,
  task: supportTask,
  scores: [politenessScorer, efficiencyScorer],
});

Define scorers in code and push to Braintrust:

trace_code_scorer.ts

import braintrust from "braintrust";
import { z } from "zod";

const project = braintrust.projects.create({ name: "my-project" });

project.scorers.create({
  name: "Politeness scorer",
  slug: "politeness-scorer",
  description: "Check if assistant responds politely",
  parameters: z.object({
    trace: z.any(),
  }),
  handler: async ({ trace }) => {
    if (!trace) return 0;

    // Get the full conversation thread
    const thread = await trace.getThread();

    // Check the last assistant message for polite language
    const lastAssistantMsg = thread.reverse().find(msg => msg.role === "assistant");
    const content = lastAssistantMsg?.content?.toLowerCase() || "";

    const politeWords = ["welcome", "glad", "happy", "pleasure", "thank"];
    const isPolite = politeWords.some(word => content.includes(word));

    return {
      score: isPolite ? 1 : 0,
      metadata: { checked_message_preview: content.slice(0, 80) },
    };
  },
});

project.scorers.create({
  name: "Efficiency scorer",
  slug: "efficiency-scorer",
  description: "Check if conversation was efficient",
  parameters: z.object({
    trace: z.any(),
  }),
  handler: async ({ trace }) => {
    if (!trace) return 0;

    // Get all LLM spans to count how many calls were made
    const llmSpans = await trace.getSpans({ spanType: ["llm"] });

    // Efficient conversations should resolve in 3-5 LLM calls
    const isEfficient = llmSpans.length >= 3 && llmSpans.length <= 5;

    return {
      score: isEfficient ? 1 : 0,
      metadata: { llm_calls: llmSpans.length },
    };
  },
});

Push to Braintrust:

npx braintrust push trace_code_scorer.ts

braintrust push trace_code_scorer.py

Create trace-level scorers in the Braintrust UI:

Go to Scorers > + Scorer.
Enter a scorer name and slug.
Select TypeScript or Python.
Set Scope to Trace to evaluate entire traces.
Write your scorer function with the trace parameter. The code editor provides real-time linting and autocomplete to help you write correct code faster.
Click Save as custom scorer.

import type { Trace } from 'braintrust';

// Enter handler function that returns a numeric score between 0 and 1,
// or an object with `score` and optional `metadata` and `name` fields,
// or null to skip scoring
async function handler({
  input,      // Automatically populated from root span
  output,     // Automatically populated from root span
  expected,   // Automatically populated from root span
  metadata,   // Automatically populated from root span
  trace,      // Trace object for accessing spans
}: {
  input: any;
  output: any;
  expected: any;
  metadata: Record<string, any>;
  trace: Trace;
}): Promise<
  | number
  | { score: number; name?: string; metadata?: Record<string, unknown> }
  | null
> {
  if (expected === null) return null;

  // Get all spans (no filter)
  const allSpans = await trace.getSpans();
  // Get only LLM spans
  const llmSpans = await trace.getSpans({ spanType: ["llm"] });

  return {
    name: "span count scorer",
    score: output === expected ? 1 : 0,
    metadata: {
      totalSpanCount: allSpans.length,
      llmSpanCount: llmSpans.length,
    },
  };
}

UI scorers have access to these packages:

anthropic
autoevals
braintrust
json
math
openai
re
requests
typing

For additional packages, use the SDK tab.

Set pass thresholds

Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting).

Add __pass_threshold to the scorer’s metadata (value between 0 and 1):

metadata: {
  __pass_threshold: 0.7,  // Scores below 0.7 are considered failures
}

Example with a custom code scorer:

project.scorers.create({
  name: "Quality checker",
  slug: "quality-checker",
  handler: async ({ output, expected }) => {
    return output === expected ? 1 : 0;
  },
  metadata: {
    __pass_threshold: 0.8,
  },
});

Test scorers

Scorers need to be developed iteratively against real data. When creating or editing a scorer in the UI, use the Run section to test your scorer with data from different sources. Each variable source populates the scorer’s input parameters (like input, output, expected, metadata) from a different location.

Test with manual input

Best for initial development when you have a specific example in mind. Use this to quickly prototype and verify basic scorer logic before testing on larger datasets.

Select Editor in the Run section.
Enter values for input, output, expected, and metadata fields.
Click Test to see how your scorer evaluates the example
Iterate on your scorer logic based on the results

Test with a dataset

Best for testing specific scenarios, edge cases, or regression testing. Use this when you want controlled, repeatable test cases or need to ensure your scorer handles specific situations correctly.

Select Dataset in the Run section.
Choose a dataset from your project.
Select a record to test with.
Click Test to see how your scorer evaluates the example.
Review results to identify patterns and edge cases.

Test with logs

Best for testing against actual usage patterns and debugging real-world edge cases. Use this when you want to see how your scorer performs on data your system is actually generating.

Select Logs in the Run section.
Select the project containing the logs you want to test against.
Filter logs to find relevant examples:
- Click Add filter and choose just root spans, specific span names, or a more advanced filter based on specific input, output, metadata, or other values.
- Select a timeframe.
Click Test to see how your scorer evaluates real production data.
Identify cases where the scorer needs adjustment for real-world scenarios.

To create a new online scoring rule with the filters automatically prepopulated from your current log filters, click Automations. This enables rapid iteration from logs to scoring rules. See Create scoring rules for more details.

Scorer permissions

Both LLM-as-a-judge scorers and custom code scorers automatically receive a BRAINTRUST_API_KEY environment variable that allows them to:

Make LLM calls using organization and project AI secrets
Access attachments from the current project
Read and write logs to the current project
Read prompts from the organization

For custom code scorers that need expanded permissions beyond the current project (such as logging to other projects, reading datasets, or accessing other organization data), you can provide your own API key using the PUT /v1/env_var endpoint.

Optimize with Loop

Generate and improve scorers using Loop: Example queries:

“Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
“Generate a code-based scorer based on project logs”
“Optimize the Helpfulness scorer”
“Adjust the scorer to be more lenient”

Best practices

Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer prompts or code. Use multiple scorers: Measure different aspects (factuality, helpfulness, tone) with separate scorers. Choose the right scope: Use trace scorers (custom code with trace parameter) for multi-step workflows and agents. Use output scorers for simple quality checks. Test scorers: Run scorers on known examples to verify they behave as expected. Version scorers: Like prompts, scorers are versioned automatically. Track what works. Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than custom code scorers.

Next steps

Run evaluations using your scorers
Interpret results to understand scores
Write prompts to guide model behavior
Use playgrounds to test scorers interactively

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Overview

Score with autoevals

Score with LLMs

Score with custom code

Set pass thresholds

Test scorers

Test with manual input

Test with a dataset

Test with logs

Scorer permissions

Optimize with Loop

Best practices

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Overview

​Score with autoevals

​Score with LLMs

​Score with custom code

​Set pass thresholds

​Test scorers

​Test with manual input

​Test with a dataset

​Test with logs

​Scorer permissions

​Optimize with Loop

​Best practices

​Next steps

Overview

Score with autoevals

Score with LLMs

Score with custom code

Set pass thresholds

Test scorers

Test with manual input

Test with a dataset

Test with logs

Scorer permissions

Optimize with Loop

Best practices

Next steps