Skip to main content
AutoEvals is a tool to quickly and easily evaluate AI model outputs.

Installation

pip install autoevals

LLM Evaluators

Battle

Compare if a solution performs better than a reference solution.

ClosedQA

Evaluate answer correctness using the model’s knowledge.

Factuality

Check factual accuracy against a reference.

Humor

Rate the humor level in text.

LLMClassifier

High-level classifier for evaluating text using LLMs.
name
Any
required
prompt_template
Any
required
choice_scores
Any
required
model
Any
use_cot
Any
max_tokens
Any
temperature
Any
engine
Any
api_key
Any
base_url
Any
client
Optional[Client]

Possible

Evaluate if a solution is feasible and practical.

Security

Evaluate if a solution has security vulnerabilities.

Sql

Compare if two SQL queries are equivalent.

Summary

Evaluate text summarization quality.

Translation

Evaluate translation quality.

String Evaluators

EmbeddingSimilarity

String similarity scorer using embeddings.
prefix
Any
model
Any
expected_min
Any
api_key
Any
base_url
Any
client
Optional[LLMClient]

ExactMatch

A scorer that tests for exact equality between values.

Levenshtein

String similarity scorer using edit distance.

Numeric Evaluators

NumericDiff

Numeric similarity scorer using normalized difference.

JSON Evaluators

JSONDiff

Compare JSON objects for structural and content similarity.
string_scorer
Scorer
number_scorer
Scorer
preserve_strings
bool

ValidJSON

Validate if a string is valid JSON and optionally matches a schema.
schema
Any

List Evaluators

ListContains

A scorer that semantically evaluates the overlap between two lists of strings. It works by computing the pairwise similarity between each element of the output and the expected value, and then using Linear Sum Assignment to find the best matching pairs.
pairwise_scorer
Any
allow_extra_entities
Any

RAGAS Evaluators

AnswerCorrectness

Evaluates how correct the generated answer is compared to the expected answer.
pairwise_scorer
Any
model
Any
factuality_weight
Any
answer_similarity_weight
Any
answer_similarity
Any
client
Optional[Client]

AnswerRelevancy

Evaluates how relevant the generated answer is to the input question.
model
Any
strictness
Any
temperature
Any
embedding_model
Any
client
Optional[Client]

AnswerSimilarity

Evaluates how semantically similar the generated answer is to the expected answer.
pairwise_scorer
Any
model
Any
client
Optional[Client]

ContextEntityRecall

Measures how well the context contains the entities mentioned in the expected answer.
pairwise_scorer
Any
model
Any
client
Optional[Client]

ContextPrecision

Measures how precise and focused the context is for answering the question.
pairwise_scorer
Any
model
Any
client
Optional[Client]

ContextRecall

Measures how well the context supports the expected answer.
pairwise_scorer
Any
model
Any
client
Optional[Client]

ContextRelevancy

Evaluates how relevant the context is to the input question.
pairwise_scorer
Any
model
Any
client
Optional[Client]

Faithfulness

Evaluates if the generated answer is faithful to the given context.
model
Any
client
Optional[Client]

Moderation

Moderation

A scorer that evaluates if AI responses contain inappropriate or unsafe content.
threshold
Any
api_key
Any
base_url
Any
client
Optional[Client]

Other

LLMClient

A client wrapper for LLM operations that supports both OpenAI SDK v0 and v1.

Source Code

For the complete Python source code and additional examples, visit the autoevals GitHub repository.