autoevals library provides pre-built scorers for common evaluation tasks. They are open-source, deterministic where possible, and optimized for speed and reliability. Autoevals evaluate individual spans. They do not evaluate entire traces.
Available scorers include:
- Factuality: Check if output contains factual information
- Semantic: Measure semantic similarity to expected output
- Levenshtein: Calculate edit distance from expected output
- JSON: Validate JSON structure and content
- SQL: Validate SQL query syntax and semantics
Install
Install theautoevals package for your language:
Score with the SDK
Use autoevals inline in your evaluation code:input: The input to your taskoutput: The output from your taskexpected: The expected output (optional)metadata: Custom metadata from the test case
Score in the UI
- Use in playgrounds: When testing prompts in playgrounds, add autoevals in the scoring section to evaluate results interactively.
- Use in experiments: When creating experiments, select autoevals from the scorer dropdown to measure output quality across your dataset.
- Use in online scoring: Add autoevals to online scoring rules to automatically evaluate production logs.
Set pass thresholds
Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting). In the UI, use the Pass threshold slider when selecting a scorer in an experiment, playground, or online scoring rule configuration.Next steps
- LLM-as-a-judge for subjective judgments like tone or helpfulness
- Custom code for business rules, pattern matching, or calculations
- Run evaluations using your scorers