- Offline evals are structured experiments used to compare and improve your app systematically.
- Online evals run scorers on live requests to monitor performance in real time.

Why are evals important?
In AI development, it’s hard for teams to understand how an update will impact performance. This breaks the typical software development loop, making iteration feel like guesswork instead of engineering. Evaluations solve this, helping you distill the non-deterministic outputs of AI applications into an effective feedback loop that enables you to ship more reliable, higher quality products. Specifically, great evals help you:- Understand whether an update is an improvement or a regression
- Quickly drill down into good / bad examples
- Diff specific examples vs. prior runs
- Avoid playing whack-a-mole
Breaking down evals
Evals consist of 3 parts:- Data: a set of examples to test your application on
- Task: the AI function you want to test (any function that takes in an
inputand returns anoutput) - Scores: a set of scoring functions that take an
input,output, and optionalexpectedvalue and compute a score
Eval() function with these 3 pieces:
View experiments
Running yourEval function will automatically create an experiment in Braintrust,
display a summary in your Terminal, and populate the UI:

- Preview each test case and score in a table
- Filter by high or low scores
- Select any individual example and see detailed tracing
- See high level scores
- Sort by improvements or regressions