AutoEvals is a tool to quickly and easily evaluate AI model outputs.


pip install autoevals


from autoevals.llm import *
# Create a new LLM-based evaluator
evaluator = Factuality()
# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"
result = evaluator(output, expected, input=input)
# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")


LLMClassifier Objects

class LLMClassifier(OpenAILLMClassifier)

An LLM-based classifier that wraps OpenAILLMClassifier and provides a standard way to apply chain of thought, parse the output, and score the result.

Battle Objects

class Battle(SpecFileClassifier)

Test whether an output better performs the instructions than the original (expected) value.

ClosedQA Objects

class ClosedQA(SpecFileClassifier)

Test whether an output answers the input using knowledge built into the model. You can specify criteria to further constrain the answer.

Humor Objects

class Humor(SpecFileClassifier)

Test whether an output is funny.

Factuality Objects

class Factuality(SpecFileClassifier)

Test whether an output is factual, compared to an original (expected) value.

Possible Objects

class Possible(SpecFileClassifier)

Test whether an output is a possible solution to the challenge posed in the input.

Security Objects

class Security(SpecFileClassifier)

Test whether an output is malicious.

Sql Objects

class Sql(SpecFileClassifier)

Test whether a SQL query is semantically the same as a reference (output) query.

Summary Objects

class Summary(SpecFileClassifier)

Test whether an output is a better summary of the input than the original (expected) value.

Translation Objects

class Translation(SpecFileClassifier)

Test whether an output is as good of a translation of the input in the specified language as an expert (expected) value..


Levenshtein Objects

class Levenshtein(Scorer)

A simple scorer that uses the Levenshtein distance to compare two strings.



EmbeddingSimilarity Objects

class EmbeddingSimilarity(Scorer)

A simple scorer that uses cosine similarity to compare two strings.

def __init__(prefix="",

Create a new EmbeddingSimilarity scorer.


  • prefix: A prefix to prepend to the prompt. This is useful for specifying the domain of the inputs.
  • model: The model to use for the embedding distance. Defaults to "text-embedding-ada-002".
  • expected_min: The minimum expected score. Defaults to 0.7. Values below this will be scored as 0, and values between this and 1 will be scaled linearly.


NumericDiff Objects

class NumericDiff(Scorer)

A simple scorer that compares numbers by normalizing their difference.


JSONDiff Objects

class JSONDiff(Scorer)

A simple scorer that compares JSON objects, using a customizable comparison method for strings (defaults to Levenshtein) and numbers (defaults to NumericDiff).

ValidJSON Objects

class ValidJSON(Scorer)

A binary scorer that evaluates the validity of JSON output, optionally validating against a JSON Schema definition (see