Skip to main content
Applies to:
  • Plan:
  • Deployment:

Summary

Issue: The Name column in experiment results always shows "eval" and cannot be overridden through Eval() or EvalCase parameters. Cause: The SDK hardcodes name="eval" when creating the root span for each row in Eval(). Resolution: Use tags to differentiate rows, or apply a monkey-patch workaround for full Name column control.

Resolution steps

Add a tags field to each EvalCase to identify rows without changing the Name column.
EvalCase(
    input="my input",
    expected="my output",
    tags=["my-custom-label"]
)

Option 2: Monkey-patch start_span (unsupported)

This overrides the hardcoded name="eval" per row. Use with caution — this is not officially supported and may break with SDK updates.
"""
Demonstrates a monkey-patch to set per-row span names in the Braintrust
eval framework, so the "Name" column in the experiment UI shows a
different value for each row instead of the hardcoded "eval".

The key insight: when the framework calls experiment.start_span(),
it passes the row's `input` and `expected` in the kwargs, so we can
derive a name from the data itself.
"""

import braintrust.framework as _fw

_orig_impl = _fw._run_evaluator_internal_impl

_name_fn = None


def set_eval_name_fn(fn):
    """Register a function that receives (input, expected) and returns a span name."""
    global _name_fn
    _name_fn = fn


async def _patched_impl(experiment, evaluator, *args, **kwargs):
    if experiment is not None:
        _orig_start = experiment.start_span

        def _patched_start(*a, **kw):
            if kw.get("name") == "eval" and _name_fn is not None:
                kw["name"] = "Custom name: " + _name_fn(
                    kw.get("input"),
                    kw.get("expected"),
                )
            return _orig_start(*a, **kw)

        experiment.start_span = _patched_start
    return await _orig_impl(experiment, evaluator, *args, **kwargs)


_fw._run_evaluator_internal_impl = _patched_impl

# ── Eval definition ──────────────────────────────────────────────────

from braintrust import Eval


def data():
    return [
        {"input": "What is 2+2?", "expected": "4"},
        {"input": "What is the capital of France?", "expected": "Paris"},
        {"input": "What color is the sky?", "expected": "Blue"},
    ]


def task(input, hooks):
    answers = {
        "What is 2+2?": "4",
        "What is the capital of France?": "Paris",
        "What color is the sky?": "Blue",
    }
    return answers.get(input, "I don't know")


def exact_match(input, output, expected):
    return output.strip().lower() == expected.strip().lower()


set_eval_name_fn(lambda input, expected: input[:40])

Eval(
    "pedro-project1",
    data=data,
    task=task,
    scores=[exact_match],
    experiment_name="per-row-name-demo",
)

Notes

  • Native experiment row name customization is not supported as of this writing.
  • The metadata field on EvalCase is another option for per-row identification if tags do not meet your needs.