Skip to main content
The autoevals library provides pre-built scorers for common evaluation tasks. They are open-source, deterministic where possible, and optimized for speed and reliability. Autoevals evaluate individual spans. They do not evaluate entire traces. Available scorers include:
  • Factuality: Check if output contains factual information
  • Semantic: Measure semantic similarity to expected output
  • Levenshtein: Calculate edit distance from expected output
  • JSON: Validate JSON structure and content
  • SQL: Validate SQL query syntax and semantics
See the TypeScript or Python reference for the complete list. You can use autoevals inline in SDK evaluation code, or select them in the UI when running experiments, testing in playgrounds, or setting up online scoring rules. There is no CLI push step. Autoevals are library imports, not pushed scorers.

Install

Install the autoevals package for your language:
# pnpm
pnpm add autoevals
# npm
npm install autoevals

Score with the SDK

Use autoevals inline in your evaluation code:
import { Eval, initDataset } from "braintrust";
import { Factuality } from "autoevals";

Eval("My Project", {
  experimentName: "My experiment",
  data: initDataset("My Project", { dataset: "My Dataset" }),
  task: async (input) => {
    // Your LLM call here
    return await callModel(input);
  },
  scores: [Factuality],
  metadata: {
    model: "gpt-5-mini",
  },
});
Autoevals automatically receive these parameters when used in evaluations:
  • input: The input to your task
  • output: The output from your task
  • expected: The expected output (optional)
  • metadata: Custom metadata from the test case

Score in the UI

  • Use in playgrounds: When testing prompts in playgrounds, add autoevals in the scoring section to evaluate results interactively.
  • Use in experiments: When creating experiments, select autoevals from the scorer dropdown to measure output quality across your dataset.
  • Use in online scoring: Add autoevals to online scoring rules to automatically evaluate production logs.

Set pass thresholds

Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting). In the UI, use the Pass threshold slider when selecting a scorer in an experiment, playground, or online scoring rule configuration.

Next steps