Autoevals

The autoevals library provides pre-built scorers for common evaluation tasks. They are open-source, deterministic where possible, and optimized for speed and reliability. Autoevals evaluate individual spans. They do not evaluate entire traces. Available scorers include:

Factuality: Check if output contains factual information
Semantic: Measure semantic similarity to expected output
Levenshtein: Calculate edit distance from expected output
JSON: Validate JSON structure and content
SQL: Validate SQL query syntax and semantics

See the TypeScript or Python reference for the complete list. You can use autoevals inline in SDK evaluation code, or select them in the UI when running experiments, testing in playgrounds, or setting up online scoring rules. There is no CLI push step. Autoevals are library imports, not pushed scorers.

Install

Install the autoevals package for your language:

# pnpm
pnpm add autoevals
# npm
npm install autoevals

Score with the SDK

Use autoevals inline in your evaluation code:

import { Eval, initDataset } from "braintrust";
import { Factuality } from "autoevals";

Eval("My Project", {
  experimentName: "My experiment",
  data: initDataset("My Project", { dataset: "My Dataset" }),
  task: async (input) => {
    // Your LLM call here
    return await callModel(input);
  },
  scores: [Factuality],
  metadata: {
    model: "gpt-5-mini",
  },
});

Autoevals automatically receive these parameters when used in evaluations:

input: The input to your task
output: The output from your task
expected: The expected output (optional)
metadata: Custom metadata from the test case

Score in the UI

Use in playgrounds: When testing prompts in playgrounds, add autoevals in the scoring section to evaluate results interactively.
Use in experiments: When creating experiments, select autoevals from the scorer dropdown to measure output quality across your dataset.
Use in online scoring: Add autoevals to online scoring rules to automatically evaluate production logs.

Set pass thresholds

Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting). In the UI, use the Pass threshold slider when selecting a scorer in an experiment, playground, or online scoring rule configuration.

Next steps

LLM-as-a-judge for subjective judgments like tone or helpfulness
Custom code for business rules, pattern matching, or calculations
Run evaluations using your scorers

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Install

Score with the SDK

Score in the UI

Set pass thresholds

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Install

​Score with the SDK

​Score in the UI

​Set pass thresholds

​Next steps

Install

Score with the SDK

Score in the UI

Set pass thresholds

Next steps