Skip to main content
Pytest is a Python testing framework. Braintrust integrates with pytest so you can run marked tests as experiments and inspect each test case as a traced span.

Setup

Install Braintrust alongside pytest:
pip install braintrust pytest
Set your API key as an environment variable:
export BRAINTRUST_API_KEY=<your-api-key>

Run your first eval

Mark tests with @pytest.mark.braintrust, accept the braintrust_span fixture, and run pytest with --braintrust.
test_my_llm.py
import pytest


@pytest.mark.braintrust(
    project="support-bot",
    input={"query": "What is Braintrust?"},
    expected={"contains": "evaluation"},
    metadata={"suite": "smoke"},
    tags=["regression"],
)
def test_support_answer(braintrust_span):
    output = ask_model("What is Braintrust?")
    braintrust_span.log(output=output)

    assert "evaluation" in output.lower()
Run the test:
pytest --braintrust --braintrust-project="support-bot"
Braintrust creates experiments from the marked tests, logs pass or fail as a score, and prints an experiment summary at the end of the run.

How it works

  • @pytest.mark.braintrust opts a test into Braintrust tracking.
  • braintrust_span gives you a standard Braintrust span for logging input, output, scores, metadata, and errors.
  • --braintrust enables experiment tracking for the session.
  • --braintrust-project and project=... on the marker control how tests are grouped into projects and experiments.
When --braintrust is not provided, braintrust_span becomes a no-op span, so the same tests still run as normal unit tests.

Parametrized tests

Pytest parameters are logged automatically as input unless you override them in the marker.
test_math.py
import pytest


@pytest.mark.braintrust
@pytest.mark.parametrize(
    "query,expected_answer",
    [
        ("2 + 2", "4"),
        ("Capital of France", "Paris"),
    ],
)
def test_qa(braintrust_span, query, expected_answer):
    output = ask_model(query)
    braintrust_span.log(output=output)

    assert expected_answer.lower() in output.lower()
Each parametrized case becomes its own span in Braintrust.

CLI options

OptionDescription
--braintrustEnable Braintrust experiment tracking
--braintrust-projectOverride the project name for all tracked tests
--braintrust-experimentOverride the experiment name
--braintrust-api-keyProvide the Braintrust API key on the command line
--braintrust-no-summarySuppress the terminal experiment summary

What to log

The braintrust_span fixture supports normal span logging methods. Typical fields to capture are:
  • input for the prompt or test payload
  • output for the model response
  • expected for the target behavior
  • scores for custom metrics beyond pass or fail
  • metadata for model name, environment, or fixture details
def test_with_scores(braintrust_span):
    output = ask_model("Summarize this ticket")
    braintrust_span.log(
        output=output,
        scores={"quality": 0.9},
        metadata={"model": "gpt-5-mini"},
    )

Resources