Eval() creates a new experiment in your Braintrust project. There can be multiple eval statements in a single file.
Data
An evaluation dataset is a list of test cases. Each has an input and optional expected output, metadata, and tags. The key fields in a data record are:- Input: The arguments that uniquely define a test case (an arbitrary, JSON serializable object). Braintrust uses the
inputto know whether two test cases are the same between evaluation runs, so the cases should not contain run-specific state. A simple rule of thumb is that if you run the same eval twice, theinputshould be identical. - Expected: (optional) the ground truth value (an arbitrary, JSON serializable object) that you’d compare to
outputto determine if youroutputvalue is correct or not. Braintrust currently does not compareoutputtoexpectedfor you, since there are many different ways to do that correctly. For example, you may use a subfield inexpectedto compare to a subfield inoutputfor a certain scoring function. Instead, these values are just used to help you navigate your evals while debugging and comparing results. - Metadata: (optional) a dictionary with additional data about the test example, model outputs, or just about anything else that’s relevant, that you can use
to help find and analyze examples later. For example, you could log the
prompt, example’sid, model parameters, or anything else that would be useful to slice/dice later. - Tags: (optional) a list of strings that you can use to filter and group records later.
Get started
To get started with evals, you need some test data. A fine starting point is to write 5-10 examples that you believe are representative. The data must have an input field (which could be complex JSON, or just a string) and should ideally have an expected output field, (although this is not required). Once you have an evaluation set up end-to-end, you can always add more test cases. You’ll know you need more data if your eval scores and outputs seem fine, but your production app doesn’t look right. And once you have logging set up, your real application data will provide a rich source of examples to use as test cases. As you scale, datasets are a great tool for managing your test cases.It’s a common misconception that you need a large volume of perfectly labeled
evaluation data, but that’s not the case. In practice, it’s better to assume
your data is noisy, your AI model is imperfect, and your scoring methods are a little
bit wrong. The goal of evaluation is to assess each of these components and
improve them over time.
Specify an existing dataset in evals
In addition to providing inline data examples when you call theEval() function, you can also pass an existing or newly initialized dataset.
Scorers
A scoring function allows you to compare the expected output of a task to the actual output and produce a score between 0 and 1. You use a scoring function by referencing it in thescores array in your eval.
We recommend starting with the scorers provided by Braintrust’s autoevals library. They work out of the box and will get you up and running quickly. Just like with test cases, once you begin running evaluations, you will find areas that need improvement. This will lead you create your own scorers, customized to your usecases, to get a well rounded-sm view of your application’s
performance.
Define your own scorers
You can define your own score, e.g.Score using AI (LLM judges)
You can also define your own prompt-based scoring functions. For example,Conditional scoring
Sometimes, the scoring function(s) you want to use depend on the input data. For example, if you’re evaluating a chatbot, you might want to use a scoring function that measures whether calculator-style inputs are correctly answered.Skip scorers
Returnnull/None to skip a scorer for a particular test case.
Scores with
null/None values will be ignored when computing the overall
score, improvements/regressions, and summary metrics like standard deviation.Handle scorers on errored test cases
By default, eval tasks or scorers that throw an exception will not generate score values. This means you may encounter a computed overall score that shows a higher value than if there were no errored test cases. If you would like to change this behavior, you can pass an unhandled score function to yourEval call. We provide a default handler that logs 0% values
to any score that doesn’t complete successfully.
List of scorers
You can also return a list of scorers from a scorer function. This allows you to dynamically generate scores based on the input data, or even combine scores together into a single score. When you return a list of scores, you must return aScore object, which has a name and a score field.
Scorers with additional fields
Certain scorers, like ClosedQA, allow additional fields to be passed in. You can pass them in by initializing them with.partial(...).
Composing scorers
Sometimes, it’s useful to build scorers that call other scorers. For example, if you’re building a translation app, you could reverse translate the output, and useEmbeddingSimilarity to compare it to the original input.
To compose scorers, call one scorer from another.
Custom metrics
Sometimes, you need to measure counts or other numbers that cannot be normalized to[0,1]. In Braintrust, these are called
metrics, and they can be aggregated just like scores, but have less built-in semantic meaning. Braintrust automatically collects
several metrics, like token usage, duration, and error counts, but you can also add your own.
For example, to log a metric corresponding to the number of docs retrieved, you can write:
Aggregation
Metrics can be aggregated within a trace (for example, to report in the experiment table) and across traces (for example, to report their performance at the experiment level). For the most part, metrics are aggregated by sum, for example token counts, but there are some exceptions, likeduration which is the max of metrics.end-metrics.start across spans within a trace.
Any custom metrics you log will be summed.
Additional metadata
While executing the task
Although you can provide metadata about each test case in the data function, it can be helpful to add additional
metadata while your task is executing. The second argument to task is a hooks object, which allows you to read
and update metadata on the test case.
Adding metadata to a scoring function
To make it easier to debug logs that do not produce a good score, you may want to log additional values in addition to the output of a scoring function. To do this, you can add ametadata field to the return value of your function, for example:
Experiment-level metadata
It can be useful to add custom metadata to your experiments, for example, to store information about the model or other parameters that you use. To set custom metadata, pass ametadata field to your Eval block:
Use custom prompts/functions from Braintrust
In addition to writing code directly in your evals, you can also use custom prompts and functions that you host in Braintrust in your code. Use cases include:- Running a code-based eval on a prompt that lives in Braintrust.
- Using a hosted scorer in your evals.
- Using a scorer written in a different language than your eval code (e.g. calling a Python scorer from a TypeScript eval).
initFunction/init_function function.
Trials
It is often useful to run each input in an evaluation multiple times, to get a sense of the variance in responses and get a more robust overall score. Braintrust supports trials as a first-class concept, allowing you to run each input multiple times. Behind the scenes, Braintrust will intelligently aggregate the results by bucketing test cases with the sameinput value and computing summary statistics for each bucket.
To enable trials, add a trialCount/trial_count property to your evaluation:
Hill climbing
Sometimes you do not have expected outputs, and instead want to use a previous experiment as a baseline. Hill climbing is inspired by, but not exactly the same as, the term used in numerical optimization. In the context of Braintrust, hill climbing is a way to iteratively improve a model’s performance by comparing new experiments to previous ones. This is especially useful when you don’t have a pre-existing benchmark to evaluate against. Braintrust supports hill climbing as a first-class concept, allowing you to use a previous experiment’soutput
field as the expected field for the current experiment. Autoevals also includes a number of scoreres, like
Summary and Battle, that are designed to work well with hill climbing.
To enable hill climbing, use BaseExperiment() in the data field of an eval:
expected field by merging the expected and output
field of the base experiment. This means that if you set expected, e.g. through the UI while reviewing results,
it will be used as the expected field for the next experiment.
Using a specific experiment
If you want to use a specific experiment as the base experiment, you can pass the name field to BaseExperiment():
- Methods that do not require an expected output, e.g.
ClosedQA, so that you can judge the quality of the output purely based on the input and output. This measure is useful to track across experiments, and it can be used to compare any two experiments, even if they are not sequentially related. - Comparative methods, e.g.
BattleorSummary, that accept anexpectedoutput but do not treat it as a ground truth. Generally speaking, if you score > 50% on a comparative method, it means you’re doing better than the base on average. To learn more about howBattleandSummarywork, check out their prompts.
Custom reporters
When you run an experiment, Braintrust logs the results to your terminal, andbraintrust eval returns a non-zero exit code if any eval throws an exception. However, it’s often useful to customize this behavior, e.g. in your CI/CD pipeline to precisely define what constitutes a failure, or to report results to a different system.
Braintrust allows you to define custom reporters that can be used to process and log results anywhere you’d like. You can define a reporter by adding a Reporter(...) block. A Reporter has two functions:
Reporter included among your evaluated files will be automatically picked up by the braintrust eval command.
- If no reporters are defined, the default reporter will be used which logs the results to the console.
- If you define one reporter, it’ll be used for all
Evalblocks. - If you define multiple
Reporters, you have to specify the reporter name as an optional 3rd argument toEval().
Attachments
Braintrust allows you to log arbitrary binary data, like images, audio, and PDFs, as attachments. The easiest way to use attachments in your evals is to initialize anAttachment object in your
data.
Tracing
Braintrust allows you to trace detailed debug information and metrics about your application that you can use to measure performance and debug issues. The trace is a tree of spans, where each span represents an expensive task, e.g. an LLM call, vector database lookup, or API request.If you are using the OpenAI API, Braintrust includes a wrapper function that
automatically logs your requests. To use it, call
wrapOpenAI/wrap_openai on your OpenAI instance. See Wrapping
OpenAI
for more info.Each call to
experiment.log() creates its own trace, starting at the time of
the previous log statement and ending at the completion of the current. Do not
mix experiment.log() with tracing. It will result in extra traces that are
not correctly parented.braintrust.traced function. Inside the wrapped function, you can log
incrementally to braintrust.currentSpan(). For example, you can progressively
log the input, output, and expected output of a task, and then log a score at the
end:


Logging SDK
The SDK allows you to report evaluation results directly from your code, without using theEval() or .traced() functions.
This is useful if you want to structure your own complex evaluation logic, or integrate Braintrust with an
existing testing or evaluation framework.
Troubleshooting
Exception when mixing log with traced
There are two ways to log to Braintrust: Experiment.log and
Experiment.traced. Experiment.log is for non-traced logging, while
Experiment.traced is for tracing. This exception is thrown when you mix both
methods on the same object, for instance:
Experiment.log or Experiment.traced,
but not both, so the SDK throws an error to prevent accidentally mixing them
together. For the above example, you most likely want to write:
allowConcurrentWithSpans: true/allow_concurrent_with_spans=True to
Experiment.log.
Local evaluation without sending logs to Braintrust
You can also run evaluations locally without creating experiments or sending data to Braintrust. In TypeScript, use thenoSendLogs parameter. In Python, use the no_send_logs parameter.
- Run all tasks and scorers locally
- Generate a local summary of results
- Not create an experiment in Braintrust
- Not send any data to the Braintrust servers
Accessing results from local evaluation
When running locally, you can access the detailed results and summary from the returned object:--no-send-logs flag when using the CLI command braintrust eval.
Online evaluation
Although you can log scores from your application, it can be awkward and computationally intensive to run evals code in your production environment. To solve this, Braintrust supports server-side online evaluations that are automatically run asynchronously as you upload logs. You can pick from the pre-built autoevals functions or your custom scorers, and define a sampling rate along with more granular filters to control which logs get evaluated.Configuring online evaluation
To create an online evaluation, navigate to the Configuration tab in a project and create an online scoring rule. The score will now automatically run at the specified sampling rate for all logs in the project.Note that online scoring will only be activated once a span has been fully
logged. We detect this by checking for the existence of a
metrics.end
timestamp on the span, which is written automatically by the SDK when the span
is finished.If you are logging through a different means, such as the REST API or any of our
API wrappers, you will have to explicitly
include metrics.end as a Unix timestamp (we also suggest metrics.start) in
order to activate online scoring.