Use autoevals
The autoevals library provides pre-built scorers for common evaluation tasks:- Factuality: Check if output contains factual information
- Semantic: Measure semantic similarity to expected output
- Levenshtein: Calculate edit distance from expected output
- JSON: Validate JSON structure and content
- SQL: Validate SQL query syntax and semantics
Create custom scorers
For specialized evaluation, create custom scorers in TypeScript, Python, or as LLM-as-a-judge.Security: For Braintrust-hosted deployments and self-hosted deployments on AWS, run in isolated AWS Lambda environments within a dedicated VPC that has no access to internal infrastructure. See code execution security for details.
- UI
- SDK
Navigate to Scorers > + Scorer to create scorers in the UI.

Configure:
Code-based scorers
Write TypeScript or Python code that evaluates outputs:
UI scorers have access to these packages:
anthropicautoevalsbraintrustjsonmathopenairerequeststyping
LLM-as-a-judge scorers
Define prompts that evaluate outputs and map choices to scores:
- Prompt: Instructions for evaluating the output
- Model: Which model to use as judge
- Choice scores: Map model choices (A, B, C) to numeric scores
- Use CoT: Enable chain-of-thought reasoning for complex evaluations
Scorer parameters
Scorers receive these parameters:- input: The input to your task
- output: The output from your task
- expected: The expected output (optional)
- metadata: Custom metadata from the test case
score and optional metadata.
Scorer permissions
Both LLM-as-a-judge scorers and code-based scorers automatically receive aBRAINTRUST_API_KEY environment variable that allows them to:
- Make LLM calls using organization and project AI secrets
- Access attachments from the current project
- Read and write logs to the current project
- Read prompts from the organization
PUT /v1/env_var endpoint.
Set pass thresholds
Define minimum acceptable scores using__pass_threshold in metadata (value between 0 and 1):
- Scores that meet or exceed the threshold are marked as passing and displayed with green highlighting and a checkmark
- Scores below the threshold are marked as failing and displayed with red highlighting
Test scorers
Scorers need to be developed iteratively against real data. When creating or editing a scorer in the UI, use the Run section to test your scorer with data from different sources. Each variable source populates the scorer’s input parameters (likeinput, output, expected, metadata) from a different location.
Test with manual input
Best for initial development when you have a specific example in mind. Use this to quickly prototype and verify basic scorer logic before testing on larger datasets.- Select Editor in the Run section.
- Enter values for
input,output,expected, andmetadatafields. - Click Test to see how your scorer evaluates the example
- Iterate on your scorer logic based on the results
Test with a dataset
Best for testing specific scenarios, edge cases, or regression testing. Use this when you want controlled, repeatable test cases or need to ensure your scorer handles specific situations correctly.- Select Dataset in the Run section.
- Choose a dataset from your project.
- Select a record to test with.
- Click Test to see how your scorer evaluates the example.
- Review results to identify patterns and edge cases.
Test with logs
Best for testing against actual usage patterns and debugging real-world edge cases. Use this when you want to see how your scorer performs on data your system is actually generating.- Select Logs in the Run section.
- Select the project containing the logs you want to test against.
- Filter logs to find relevant examples:
- Click Add filter and choose just root spans, specific span names, or a more advanced filter based on specific input, output, metadata, or other values.
- Select a timeframe.
- Click Test to see how your scorer evaluates real production data.
- Identify cases where the scorer needs adjustment for real-world scenarios.
Optimize with Loop
Generate and improve scorers using Loop: Example queries:- “Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
- “Generate a code-based scorer based on project logs”
- “Optimize the Helpfulness scorer”
- “Adjust the scorer to be more lenient”
Best practices
Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer prompts or code. Use multiple scorers: Measure different aspects (factuality, helpfulness, tone) with separate scorers. Test scorers: Run scorers on known examples to verify they behave as expected. Version scorers: Like prompts, scorers are versioned automatically. Track what works. Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than code-based scorers.Next steps
- Run evaluations using your scorers
- Interpret results to understand scores
- Write prompts to guide model behavior
- Use playgrounds to test scorers interactively