Skip to main content
Scorers evaluate AI outputs by assigning scores between 0 and 100%. Use pre-built scorers from autoevals, create custom code-based scorers, or build LLM-as-a-judge scorers to measure what matters for your application.

Use autoevals

The autoevals library provides pre-built scorers for common evaluation tasks:
import { Factuality, Levenshtein, Semantic } from "autoevals";
from autoevals import Factuality, Levenshtein, Semantic
Popular autoevals scorers:
  • Factuality: Check if output contains factual information
  • Semantic: Measure semantic similarity to expected output
  • Levenshtein: Calculate edit distance from expected output
  • JSON: Validate JSON structure and content
  • SQL: Validate SQL query syntax and semantics
Use autoevals directly in evaluations:
import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],
});

Create custom scorers

For specialized evaluation, create custom scorers in TypeScript, Python, or as LLM-as-a-judge.
Navigate to Scorers > + Scorer to create scorers in the UI.

Code-based scorers

Write TypeScript or Python code that evaluates outputs:Create TypeScript scorer
UI scorers have access to these packages:
  • anthropic
  • autoevals
  • braintrust
  • json
  • math
  • openai
  • re
  • requests
  • typing
For additional packages, use the SDK method below.

LLM-as-a-judge scorers

Define prompts that evaluate outputs and map choices to scores:Create LLM-as-a-judge scorerConfigure:
  • Prompt: Instructions for evaluating the output
  • Model: Which model to use as judge
  • Choice scores: Map model choices (A, B, C) to numeric scores
  • Use CoT: Enable chain-of-thought reasoning for complex evaluations

Scorer parameters

Scorers receive these parameters:
  • input: The input to your task
  • output: The output from your task
  • expected: The expected output (optional)
  • metadata: Custom metadata from the test case
Return a number between 0 and 1, or an object with score and optional metadata.

Set pass thresholds

Define minimum acceptable scores using __pass_threshold in metadata (value between 0 and 1):
metadata: {
  __pass_threshold: 0.7,  // Scores below 0.7 are considered failures
}
You can also set pass thresholds when creating or editing scorers in the UI using the threshold slider. When a scorer has a pass threshold configured:
  • Scores that meet or exceed the threshold are marked as passing and displayed with green highlighting and a checkmark
  • Scores below the threshold are marked as failing and displayed with red highlighting
This visual feedback makes it easy to scan evaluation results and identify which outputs meet your quality criteria at a glance.

Optimize with Loop

Generate and improve scorers using Loop: Example queries:
  • “Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
  • “Generate a code-based scorer based on project logs”
  • “Optimize the Helpfulness scorer”
  • “Adjust the scorer to be more lenient”
Loop can also tune scorers based on manual labels from the playground.

Best practices

Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer prompts or code. Use multiple scorers: Measure different aspects (factuality, helpfulness, tone) with separate scorers. Test scorers: Run scorers on known examples to verify they behave as expected. Version scorers: Like prompts, scorers are versioned automatically. Track what works. Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than code-based scorers.

Next steps