Autoevals TypeScript API

AutoEvals is a tool to quickly and easily evaluate AI model outputs.

Installation

npm install autoevals

RAGAS Evaluators

AnswerCorrectness

Measures answer correctness compared to ground truth using a weighted average of factuality and semantic similarity.

args

ScorerArgs

AnswerRelevancy

Scores the relevancy of the generated answer to the given question. Answers with incomplete, redundant or unnecessary information are penalized.

args

ScorerArgs

AnswerSimilarity

Scores the semantic similarity between the generated answer and ground truth.

args

ScorerArgs

ContextEntityRecall

Estimates context recall by estimating TP and FN using annotated answer and retrieved context.

args

ScorerArgs

ContextPrecision

ContextPrecision evaluator function.

args

ScorerArgs

ContextRecall

ContextRecall evaluator function.

args

ScorerArgs

ContextRelevancy

ContextRelevancy evaluator function.

args

ScorerArgs

Faithfulness

Measures factual consistency of the generated answer with the given context.

args

ScorerArgs

LLM Evaluators

Battle

Test whether an output better performs the instructions than the original (expected) value.

args

ScorerArgs

ClosedQA

Test whether an output answers the input using knowledge built into the model. You can specify criteria to further constrain the answer.

args

ScorerArgs

Factuality

Test whether an output is factual, compared to an original (expected) value.

args

ScorerArgs

Humor

Test whether an output is funny.

args

ScorerArgs

Possible

Test whether an output is a possible solution to the challenge posed in the input.

args

ScorerArgs

Security

Test whether an output is malicious.

args

ScorerArgs

Sql

Test whether a SQL query is semantically the same as a reference (output) query.

args

ScorerArgs

Summary

Test whether an output is a better summary of the input than the original (expected) value.

args

ScorerArgs

Translation

Test whether an output is as good of a translation of the input in the specified language as an expert (expected) value.

args

ScorerArgs

String Evaluators

EmbeddingSimilarity

A scorer that uses cosine similarity to compare two strings.

args

ScorerArgs

ExactMatch

A simple scorer that tests whether two values are equal. If the value is an object or array, it will be JSON-serialized and the strings compared for equality.

args

reflection

Levenshtein

A simple scorer that uses the Levenshtein distance to compare two strings.

args

reflection

LevenshteinScorer

LevenshteinScorer evaluator function.

args

reflection

JSON Evaluators

JSONDiff

Compare JSON objects for structural and content similarity. This scorer recursively compares JSON objects, handling:

Nested dictionaries and arrays
String similarity using Levenshtein distance (or custom scorer)
Numeric value comparison (or custom scorer)
Automatic parsing of JSON strings

args

ScorerArgs

ValidJSON

Validate if a value is valid JSON and optionally matches a JSON Schema. This scorer checks if:

The input can be parsed as valid JSON (if it’s a string)
The parsed JSON matches an optional JSON Schema
Handles both string inputs and pre-parsed JSON objects

args

ScorerArgs

Custom Evaluators

LLMClassifierFromSpec

LLMClassifierFromSpec evaluator function.

name

string

spec

reflection

LLMClassifierFromSpecFile

LLMClassifierFromSpecFile evaluator function.

name

string

templateName

LLMClassifierFromTemplate

LLMClassifierFromTemplate evaluator function.

__namedParameters

reflection

OpenAIClassifier

OpenAIClassifier evaluator function.

args

ScorerArgs

buildClassificationTools

buildClassificationTools evaluator function.

useCoT

boolean

choiceStrings

array

List Evaluators

ListContains

A scorer that semantically evaluates the overlap between two lists of strings. It works by computing the pairwise similarity between each element of the output and the expected value, and then using Linear Sum Assignment to find the best matching pairs.

args

ScorerArgs

Moderation

A scorer that uses OpenAI’s moderation API to determine if AI response contains ANY flagged content.

args

ScorerArgs

Numeric Evaluators

NumericDiff

A simple scorer that compares numbers by normalizing their difference.

args

reflection

Other

computeThreadTemplateVars

Compute template variables from a thread for use in mustache templates. Uses lazy getters so expensive computations only run when accessed. Note: thread (and other message variables) will automatically render as human-readable text when used in templates like \{\{thread\}\} due to the smart escape function in renderMessages.

thread

array

formatMessageArrayAsText

Format an array of LLM messages as human-readable text.

messages

array

getDefaultModel

Get the configured default completion model, or “gpt-4o” if not set.

isLLMMessageArray

Check if a value is an array of LLM messages.

value

unknown

isRoleContentMessage

Check if an item looks like an LLM message (has role and content).

item

unknown

templateUsesThreadVariables

Check if a template string might use thread-related template variables. This is a heuristic - looks for variable names after \{\{ or \{% syntax.

template

string

Configuration

init

Initialize autoevals with a custom client and/or default models.

__namedParameters

InitOptions

Utilities

makePartial

makePartial evaluator function.

Scorer

name

string

normalizeValue

normalizeValue evaluator function.

value

unknown

maybeObject

boolean

Source Code

For the complete TypeScript source code and additional examples, visit the autoevals GitHub repository.

SDKs

API

Other

​Installation

​RAGAS Evaluators

​AnswerCorrectness

​AnswerRelevancy

​AnswerSimilarity

​ContextEntityRecall

​ContextPrecision

​ContextRecall

​ContextRelevancy

​Faithfulness

​LLM Evaluators

​Battle

​ClosedQA

​Factuality

​Humor

​Possible

​Security

​Sql

​Summary

​Translation

​String Evaluators

​EmbeddingSimilarity

​ExactMatch

​Levenshtein

​LevenshteinScorer

​JSON Evaluators

​JSONDiff

​ValidJSON

​Custom Evaluators

​LLMClassifierFromSpec

​LLMClassifierFromSpecFile

​LLMClassifierFromTemplate

​OpenAIClassifier

​buildClassificationTools

​List Evaluators

​ListContains

​Moderation

​Moderation

​Numeric Evaluators

​NumericDiff

​Other

​computeThreadTemplateVars

​formatMessageArrayAsText

​getDefaultModel

​isLLMMessageArray

​isRoleContentMessage

​templateUsesThreadVariables

​Configuration

​init

​Utilities

​makePartial

​normalizeValue

​Source Code

Installation

RAGAS Evaluators

AnswerCorrectness

AnswerRelevancy

AnswerSimilarity

ContextEntityRecall

ContextPrecision

ContextRecall

ContextRelevancy

Faithfulness

LLM Evaluators

Battle

ClosedQA

Factuality

Humor

Possible

Security

Sql

Summary

Translation

String Evaluators

EmbeddingSimilarity

ExactMatch

Levenshtein

LevenshteinScorer

JSON Evaluators

JSONDiff

ValidJSON

Custom Evaluators

LLMClassifierFromSpec

LLMClassifierFromSpecFile

LLMClassifierFromTemplate

OpenAIClassifier

buildClassificationTools

List Evaluators

ListContains

Moderation

Moderation

Numeric Evaluators

NumericDiff

Other

computeThreadTemplateVars

formatMessageArrayAsText

getDefaultModel

isLLMMessageArray

isRoleContentMessage

templateUsesThreadVariables

Configuration

init

Utilities

makePartial

normalizeValue

Source Code