Python Autoevals - Braintrust

AutoEvals is a tool to quickly and easily evaluate AI model outputs.

Installation

pip install autoevals

LLM Evaluators

Battle

Compare if a solution performs better than a reference solution.

ClosedQA

Evaluate answer correctness using the model’s knowledge.

Factuality

Check factual accuracy against a reference.

Humor

Rate the humor level in text.

LLMClassifier

High-level classifier for evaluating text using LLMs.

name

Any

required

prompt_template

Any

required

choice_scores

Any

required

model

Any

use_cot

Any

max_tokens

Any

temperature

Any

engine

Any

api_key

Any

base_url

Any

client

Optional[Client]

Possible

Evaluate if a solution is feasible and practical.

Security

Evaluate if a solution has security vulnerabilities.

Sql

Compare if two SQL queries are equivalent.

Summary

Evaluate text summarization quality.

Translation

Evaluate translation quality.

String Evaluators

EmbeddingSimilarity

String similarity scorer using embeddings.

prefix

Any

model

Any

expected_min

Any

api_key

Any

base_url

Any

client

Optional[LLMClient]

ExactMatch

A scorer that tests for exact equality between values.

Levenshtein

String similarity scorer using edit distance.

Numeric Evaluators

NumericDiff

Numeric similarity scorer using normalized difference.

JSON Evaluators

JSONDiff

Compare JSON objects for structural and content similarity.

string_scorer

Scorer

number_scorer

Scorer

preserve_strings

bool

ValidJSON

Validate if a string is valid JSON and optionally matches a schema.

schema

Any

List Evaluators

ListContains

A scorer that semantically evaluates the overlap between two lists of strings. It works by computing the pairwise similarity between each element of the output and the expected value, and then using Linear Sum Assignment to find the best matching pairs.

pairwise_scorer

Any

allow_extra_entities

Any

RAGAS Evaluators

AnswerCorrectness

Evaluates how correct the generated answer is compared to the expected answer.

pairwise_scorer

Any

model

Any

factuality_weight

Any

answer_similarity_weight

Any

answer_similarity

Any

client

Optional[Client]

AnswerRelevancy

Evaluates how relevant the generated answer is to the input question.

model

Any

strictness

Any

temperature

Any

embedding_model

Any

client

Optional[Client]

AnswerSimilarity

Evaluates how semantically similar the generated answer is to the expected answer.

pairwise_scorer

Any

model

Any

client

Optional[Client]

ContextEntityRecall

Measures how well the context contains the entities mentioned in the expected answer.

pairwise_scorer

Any

model

Any

client

Optional[Client]

ContextPrecision

Measures how precise and focused the context is for answering the question.

pairwise_scorer

Any

model

Any

client

Optional[Client]

ContextRecall

Measures how well the context supports the expected answer.

pairwise_scorer

Any

model

Any

client

Optional[Client]

ContextRelevancy

Evaluates how relevant the context is to the input question.

pairwise_scorer

Any

model

Any

client

Optional[Client]

Faithfulness

Evaluates if the generated answer is faithful to the given context.

model

Any

client

Optional[Client]

Moderation

A scorer that evaluates if AI responses contain inappropriate or unsafe content.

threshold

Any

api_key

Any

base_url

Any

client

Optional[Client]

Other

LLMClient

A client wrapper for LLM operations that supports both OpenAI SDK v0 and v1.

Source Code

For the complete Python source code and additional examples, visit the autoevals GitHub repository.

Get started

Core

Context

Integrations

Best practices

Reference

​Installation

​LLM Evaluators

​Battle

​ClosedQA

​Factuality

​Humor

​LLMClassifier

​Possible

​Security

​Sql

​Summary

​Translation

​String Evaluators

​EmbeddingSimilarity

​ExactMatch

​Levenshtein

​Numeric Evaluators

​NumericDiff

​JSON Evaluators

​JSONDiff

​ValidJSON

​List Evaluators

​ListContains

​RAGAS Evaluators

​AnswerCorrectness

​AnswerRelevancy

​AnswerSimilarity

​ContextEntityRecall

​ContextPrecision

​ContextRecall

​ContextRelevancy

​Faithfulness

​Moderation

​Moderation

​Other

​LLMClient

​Source Code

Installation

LLM Evaluators

Battle

ClosedQA

Factuality

Humor

LLMClassifier

Possible

Security

Sql

Summary

Translation

String Evaluators

EmbeddingSimilarity

ExactMatch

Levenshtein

Numeric Evaluators

NumericDiff

JSON Evaluators

JSONDiff

ValidJSON

List Evaluators

ListContains

RAGAS Evaluators

AnswerCorrectness

AnswerRelevancy

AnswerSimilarity

ContextEntityRecall

ContextPrecision

ContextRecall

ContextRelevancy

Faithfulness

Moderation

Moderation

Other

LLMClient

Source Code