TypeScript Autoevals

AutoEvals is a tool to quickly and easily evaluate AI model outputs.

Installation

npm install autoevals

RAGAS Evaluators

AnswerCorrectness

Measures answer correctness compared to ground truth using a weighted average of factuality and semantic similarity.

answerSimilarity

Scorer<string, object>

answerSimilarityWeight

number

factualityWeight

number

AnswerRelevancy

Scores the relevancy of the generated answer to the given question. Answers with incomplete, redundant or unnecessary information are penalized.

strictness

number

AnswerSimilarity

Scores the semantic similarity between the generated answer and ground truth.

args

ScorerArgs<string, RagasArgs>

ContextEntityRecall

Estimates context recall by estimating TP and FN using annotated answer and retrieved context.

pairwiseScorer

Scorer<string, object>

Faithfulness

Measures factual consistency of the generated answer with the given context.

args

ScorerArgs<string, RagasArgs>

LLM Evaluators

Battle

Test whether an output better performs the instructions than the original (expected) value.

instructions

string

required

ClosedQA

Test whether an output answers the input using knowledge built into the model. You can specify criteria to further constrain the answer.

criteria

any

required

input

string

required

Factuality

Test whether an output is factual, compared to an original (expected) value.

expected

string

input

string

required

output

string

required

Humor

Test whether an output is funny.

args

ScorerArgs<string, LLMClassifierArgs<{}>>

Possible

Test whether an output is a possible solution to the challenge posed in the input.

input

string

required

Security

Test whether an output is malicious.

args

ScorerArgs<string, LLMClassifierArgs<{}>>

Sql

Test whether a SQL query is semantically the same as a reference (output) query.

input

string

required

Summary

Test whether an output is a better summary of the input than the original (expected) value.

input

string

required

Translation

Test whether an output is as good of a translation of the input in the specified language as an expert (expected) value.

input

string

required

language

string

required

String Evaluators

EmbeddingSimilarity

A scorer that uses cosine similarity to compare two strings.

expectedMin

number

model

string

prefix

string

ExactMatch

A simple scorer that tests whether two values are equal. If the value is an object or array, it will be JSON-serialized and the strings compared for equality.

args

Object

Levenshtein

A simple scorer that uses the Levenshtein distance to compare two strings.

args

Object

JSON Evaluators

JSONDiff

A simple scorer that compares JSON objects, using a customizable comparison method for strings (defaults to Levenshtein) and numbers (defaults to NumericDiff).

numberScorer

Scorer<number, object>

preserveStrings

boolean

stringScorer

Scorer<string, object>

ValidJSON

A binary scorer that evaluates the validity of JSON output, optionally validating against a JSON Schema definition (see https://json-schema.org/learn/getting-started-step-by-step#create).

schema

any

List Evaluators

ListContains

A scorer that semantically evaluates the overlap between two lists of strings. It works by computing the pairwise similarity between each element of the output and the expected value, and then using Linear Sum Assignment to find the best matching pairs.

allowExtraEntities

boolean

pairwiseScorer

Scorer<string, {}>

Moderation

A scorer that uses OpenAI’s moderation API to determine if AI response contains ANY flagged content.

threshold

number

Numeric Evaluators

NumericDiff

A simple scorer that compares numbers by normalizing their difference.

args

Object

Source Code

For the complete TypeScript source code and additional examples, visit the autoevals GitHub repository.

Start

Integrations

Core

Context

Best practices

Reference

TypeScript Autoevals

Installation

RAGAS Evaluators

AnswerCorrectness

AnswerRelevancy

AnswerSimilarity

ContextEntityRecall

Faithfulness

LLM Evaluators

Battle

ClosedQA

Factuality

Humor

Possible

Security

Sql

Summary

Translation

String Evaluators

EmbeddingSimilarity

ExactMatch

Levenshtein

JSON Evaluators

JSONDiff

ValidJSON

List Evaluators

ListContains

Moderation

Moderation

Numeric Evaluators

NumericDiff

Source Code

Start

Integrations

Core

Context

Best practices

Reference

​Installation

​RAGAS Evaluators

​AnswerCorrectness

​AnswerRelevancy

​AnswerSimilarity

​ContextEntityRecall

​Faithfulness

​LLM Evaluators

​Battle

​ClosedQA

​Factuality

​Humor

​Possible

​Security

​Sql

​Summary

​Translation

​String Evaluators

​EmbeddingSimilarity

​ExactMatch

​Levenshtein

​JSON Evaluators

​JSONDiff

​ValidJSON

​List Evaluators

​ListContains

​Moderation

​Moderation

​Numeric Evaluators

​NumericDiff

​Source Code

Installation

RAGAS Evaluators

AnswerCorrectness

AnswerRelevancy

AnswerSimilarity

ContextEntityRecall

Faithfulness

LLM Evaluators

Battle

ClosedQA

Factuality

Humor

Possible

Security

Sql

Summary

Translation

String Evaluators

EmbeddingSimilarity

ExactMatch

Levenshtein

JSON Evaluators

JSONDiff

ValidJSON

List Evaluators

ListContains

Moderation

Moderation

Numeric Evaluators

NumericDiff

Source Code