Pytest - Braintrust

Pytest is a Python testing framework. Braintrust integrates with pytest so you can run marked tests as experiments and inspect each test case as a traced span.

Setup

Install Braintrust alongside pytest:

pip install braintrust pytest

Set your API key as an environment variable:

export BRAINTRUST_API_KEY=<your-api-key>

Run your first eval

Mark tests with @pytest.mark.braintrust, accept the braintrust_span fixture, and run pytest with --braintrust.

test_my_llm.py

import pytest


@pytest.mark.braintrust(
    project="support-bot",
    input={"query": "What is Braintrust?"},
    expected={"contains": "evaluation"},
    metadata={"suite": "smoke"},
    tags=["regression"],
)
def test_support_answer(braintrust_span):
    output = ask_model("What is Braintrust?")
    braintrust_span.log(output=output)

    assert "evaluation" in output.lower()

Run the test:

pytest --braintrust --braintrust-project="support-bot"

Braintrust creates experiments from the marked tests, logs pass or fail as a score, and prints an experiment summary at the end of the run.

How it works

@pytest.mark.braintrust opts a test into Braintrust tracking.
braintrust_span gives you a standard Braintrust span for logging input, output, scores, metadata, and errors.
--braintrust enables experiment tracking for the session.
--braintrust-project and project=... on the marker control how tests are grouped into projects and experiments.

When --braintrust is not provided, braintrust_span becomes a no-op span, so the same tests still run as normal unit tests.

Parametrized tests

Pytest parameters are logged automatically as input unless you override them in the marker.

test_math.py

import pytest


@pytest.mark.braintrust
@pytest.mark.parametrize(
    "query,expected_answer",
    [
        ("2 + 2", "4"),
        ("Capital of France", "Paris"),
    ],
)
def test_qa(braintrust_span, query, expected_answer):
    output = ask_model(query)
    braintrust_span.log(output=output)

    assert expected_answer.lower() in output.lower()

Each parametrized case becomes its own span in Braintrust.

CLI options

Option	Description
`--braintrust`	Enable Braintrust experiment tracking
`--braintrust-project`	Override the project name for all tracked tests
`--braintrust-experiment`	Override the experiment name
`--braintrust-api-key`	Provide the Braintrust API key on the command line
`--braintrust-no-summary`	Suppress the terminal experiment summary

What to log

The braintrust_span fixture supports normal span logging methods. Typical fields to capture are:

input for the prompt or test payload
output for the model response
expected for the target behavior
scores for custom metrics beyond pass or fail
metadata for model name, environment, or fixture details

def test_with_scores(braintrust_span):
    output = ask_model("Summarize this ticket")
    braintrust_span.log(
        output=output,
        scores={"quality": 0.9},
        metadata={"model": "gpt-5-mini"},
    )

​Setup

​Run your first eval

​How it works

​Parametrized tests

​CLI options

​What to log

​Resources