Use autoevals
The autoevals library provides pre-built scorers for common evaluation tasks:- Factuality: Check if output contains factual information
- Semantic: Measure semantic similarity to expected output
- Levenshtein: Calculate edit distance from expected output
- JSON: Validate JSON structure and content
- SQL: Validate SQL query syntax and semantics
Create custom scorers
For specialized evaluation, create custom scorers in TypeScript, Python, or as LLM-as-a-judge.- UI
- SDK
Navigate to Scorers > + Scorer to create scorers in the UI.

Configure:
Code-based scorers
Write TypeScript or Python code that evaluates outputs:
UI scorers have access to these packages:
anthropicautoevalsbraintrustjsonmathopenairerequeststyping
LLM-as-a-judge scorers
Define prompts that evaluate outputs and map choices to scores:
- Prompt: Instructions for evaluating the output
- Model: Which model to use as judge
- Choice scores: Map model choices (A, B, C) to numeric scores
- Use CoT: Enable chain-of-thought reasoning for complex evaluations
Scorer parameters
Scorers receive these parameters:- input: The input to your task
- output: The output from your task
- expected: The expected output (optional)
- metadata: Custom metadata from the test case
score and optional metadata.
Set pass thresholds
Define minimum acceptable scores using__pass_threshold in metadata (value between 0 and 1):
- Scores that meet or exceed the threshold are marked as passing and displayed with green highlighting and a checkmark
- Scores below the threshold are marked as failing and displayed with red highlighting
Optimize with Loop
Generate and improve scorers using Loop: Example queries:- “Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
- “Generate a code-based scorer based on project logs”
- “Optimize the Helpfulness scorer”
- “Adjust the scorer to be more lenient”
Best practices
Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer prompts or code. Use multiple scorers: Measure different aspects (factuality, helpfulness, tone) with separate scorers. Test scorers: Run scorers on known examples to verify they behave as expected. Version scorers: Like prompts, scorers are versioned automatically. Track what works. Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than code-based scorers.Next steps
- Run evaluations using your scorers
- Interpret results to understand scores
- Write prompts to guide model behavior
- Use playgrounds to test scorers interactively