Run locally
Run evaluation code locally to create an experiment in Braintrust and return summary metrics, including a direct link to your experiment. See Interpret results for how to read it.- TypeScript
- Python
- Go
- Ruby
- Java
- C#
Install the SDK and dependencies:Create the eval code:Run your evaluation with the Use
braintrust eval CLI:--watch to re-run automatically when files change:Benefits of using the CLI
Benefits of using the CLI
- Automatic
.envloading — reads.env.development.local,.env.local,.env.development, and.env - Multi-file support — pass multiple files or directories:
braintrust eval [file or directory] ... - TypeScript transpilation — no build step required; the CLI handles it
Run in UI
Create from scratch
Create and run experiments directly in the Braintrust UI without writing code:- Go to Experiments.
- Click + Experiment or use the empty state form.
- Select one or more prompts, workflows, or scorers to evaluate.
- Choose or create a dataset:
- Select existing dataset: Pick from datasets in your organization
- Upload CSV/JSON: Import test cases from a file
- Empty dataset: Create a blank dataset to populate manually later
- Add scorers to measure output quality.
- Click Create to execute the experiment.
UI experiments run without a time limit on cloud and on self-hosted deployments running data plane v2.0 or later.
Promote from a playground
Playground runs are mutable — re-running overwrites previous results. When you’ve iterated to a configuration worth keeping, promote it to an experiment to capture an immutable snapshot:- Run your playground.
- Select + Experiment.
- Name your experiment.
- Access it from the Experiments page.
Run in CI/CD
Integrate evaluations into your CI/CD pipeline to catch regressions before they reach production.GitHub Actions
Use thebraintrustdata/eval-action to run evaluations on every pull request:
Other CI systems
For other CI systems, run evaluations as a standard shell command:BRAINTRUST_API_KEY environment variable set.
Configure experiments
Customize experiment behavior with options:Run without uploading results
Sometimes you want to run your evaluation locally without creating an experiment in Braintrust — while iterating on a new scorer, wiring up a new eval pipeline, or running in an environment without a Braintrust API key. Your tasks and scorers still run and print a summary to your terminal; results just aren’t uploaded.- TypeScript
- Python
Via the CLI:Or in code:
Run trials
Run each input multiple times to measure variance and get more robust scores. Braintrust intelligently aggregates results by bucketing test cases with the sameinput value:
Enable hill climbing
Hill climbing lets you improve iteratively without expected outputs by using a previous experiment’soutput as the expected for the current run. To enable it, use BaseExperiment() in the data field. Autoevals scorers like Battle and Summary are designed specifically for this workflow.
expected field by merging the expected and output fields from the base experiment. If you set expected through the UI while reviewing results, it will be used as the expected field for the next experiment.
To use a specific experiment as the base, pass the name field to BaseExperiment():
- Non-comparative methods like
ClosedQAthat judge output quality based purely on input and output without requiring an expected value. Track these across experiments to compare any two experiments, even if they aren’t sequentially related. - Comparative methods like
BattleorSummarythat accept anexpectedoutput but don’t treat it as ground truth. If you score > 50% on a comparative method, you’re doing better than the base on average. Learn more about how Battle and Summary work.
Create custom reporters
When you run an experiment, Braintrust logs results to your terminal, andbraintrust eval returns a non-zero exit code if any eval throws an exception. Customize this behavior for CI/CD pipelines to precisely define what constitutes a failure or to report results to different systems.
Define custom reporters using Reporter(). A reporter has two functions:
Reporter included among your evaluated files will be automatically picked up by the braintrust eval command.
- If no reporters are defined, the default reporter logs results to the console.
- If you define one reporter, it’s used for all
Evalblocks. - If you define multiple
Reporters, specify the reporter name as an optional third argument to the eval function.
Include attachments
Braintrust allows you to log binary data like images, audio, and PDFs as attachments. Use attachments in evaluations by initializing anAttachment object in your data:
Trace your evals
Add detailed tracing to your evaluation task functions to measure performance and debug issues. Each span in the trace represents an operation like an LLM call, database lookup, or API request.Use
wrapOpenAI/wrap_openai to automatically trace OpenAI API calls. See Trace LLM calls for details.traced() to log incrementally to spans. This example progressively logs input, output, and metrics:
Troubleshooting
Evaluations running slowly with maxConcurrency?
Evaluations running slowly with maxConcurrency?
If your evaluations are slower than expected when using
maxConcurrency, you may be on an older SDK version that flushes logs after every single task completion. Upgrade to TypeScript SDK v3.3.0+ for up to an 8x performance improvement. The SDK now uses byte-based backpressure for better flushing performance.You can tune the flush threshold with the BRAINTRUST_FLUSH_BACKPRESSURE_BYTES environment variable. See Tune performance for all available configuration options.Task function throws an exception during eval (C# SDK v0.2.2+)
Task function throws an exception during eval (C# SDK v0.2.2+)
When the task function throws, the C# eval framework catches the exception, records it on the task span and root span (with The task span and root eval span both receive an OTel exception event with
ActivityStatusCode.Error), and calls ScoreForTaskException on every scorer instead of Score. The eval continues — no cases are skipped.By default, ScoreForTaskException returns a single score of 0.0. Override it on your IScorer to return a custom fallback score, return an empty list to omit scoring for that case, or re-throw to abort the eval.#skip-compile
exception.type, exception.message, and exception.stacktrace attributes, visible in any OTel-compatible backend connected to Braintrust.Scorer throws an exception during eval (C# SDK v0.2.2+)
Scorer throws an exception during eval (C# SDK v0.2.2+)
When a scorer’s Score spans are named
Score method throws, the exception is recorded on that scorer’s span (with ActivityStatusCode.Error and an OTel exception event) and ScoreForScorerException is called as a fallback. Other scorers continue running unaffected.By default, ScoreForScorerException returns a single score of 0.0. Override it to return a custom fallback, return an empty list to omit the score, or re-throw to abort the eval.#skip-compile
score:<scorer_name> (e.g. score:my_scorer), making individual scorer traces distinguishable in Braintrust and any connected OTel backend.Next steps
- Interpret results from your experiments
- Compare experiments to measure improvements
- Test complex agents to connect custom code to the playground
- Write scorers to measure quality