Skip to main content AutoEvals is a tool to quickly and easily evaluate AI model outputs.
Installation
LLM Evaluators
Battle
Compare if a solution performs better than a reference solution.
ClosedQA
Evaluate answer correctness using the model’s knowledge.
Factuality
Check factual accuracy against a reference.
Humor
Rate the humor level in text.
LLMClassifier
High-level classifier for evaluating text using LLMs.
Possible
Evaluate if a solution is feasible and practical.
Security
Evaluate if a solution has security vulnerabilities.
Sql
Compare if two SQL queries are equivalent.
Summary
Evaluate text summarization quality.
Translation
Evaluate translation quality.
String Evaluators
EmbeddingSimilarity
String similarity scorer using embeddings.
ExactMatch
A scorer that tests for exact equality between values.
Levenshtein
String similarity scorer using edit distance.
Numeric Evaluators
NumericDiff
Numeric similarity scorer using normalized difference.
JSON Evaluators
JSONDiff
Compare JSON objects for structural and content similarity.
ValidJSON
Validate if a string is valid JSON and optionally matches a schema.
List Evaluators
ListContains
A scorer that semantically evaluates the overlap between two lists of strings. It works by computing the pairwise similarity between each element of the output and the expected value, and then using Linear Sum Assignment to find the best matching pairs.
RAGAS Evaluators
AnswerCorrectness
Evaluates how correct the generated answer is compared to the expected answer.
AnswerRelevancy
Evaluates how relevant the generated answer is to the input question.
AnswerSimilarity
Evaluates how semantically similar the generated answer is to the expected answer.
ContextEntityRecall
Measures how well the context contains the entities mentioned in the expected answer.
ContextPrecision
Measures how precise and focused the context is for answering the question.
ContextRecall
Measures how well the context supports the expected answer.
ContextRelevancy
Evaluates how relevant the context is to the input question.
Faithfulness
Evaluates if the generated answer is faithful to the given context.
Moderation
Moderation
A scorer that evaluates if AI responses contain inappropriate or unsafe content.
Other
LLMClient
A client wrapper for LLM operations that supports both OpenAI SDK v0 and v1.
Source Code
For the complete Python source code and additional examples, visit the autoevals GitHub repository .