Skip to main content

Documentation Index

Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

LLM-as-a-judge scorer prompts support mustache templating. The variables available depend on whether the scorer is scoped to a Span or a Trace.

Span-level variables

Available in any scorer with Scope: Span. Each matching span is scored independently.
VariableDescription
{{input}}Input passed to the span
{{output}}Output produced by the span
{{expected}}Expected output, if provided (optional)
{{metadata}}Custom metadata attached to the span
Example prompt:
Rate the helpfulness of this response.

Input: {{input}}
Output: {{output}}
{{#expected}}
Expected: {{expected}}
{{/expected}}

Return "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.

Trace-level variables

Available in scorers with Scope: Trace. The scorer runs once per trace and has access to the full conversation thread. The four span-level variables (input, output, expected, metadata) are also available here and are populated from the root span of the trace.
VariableTypeDescription
{{input}}anyInput from the root span
{{output}}anyOutput from the root span
{{expected}}anyExpected output from the root span (optional)
{{metadata}}objectMetadata from the root span
{{thread}}textFull conversation rendered as human-readable text
{{thread_count}}numberTotal number of messages in the thread
{{first_message}}objectFirst message in the thread
{{last_message}}objectLast message in the thread
{{user_messages}}arrayAll user/human messages only
{{assistant_messages}}arrayAll assistant messages only
{{human_ai_pairs}}arrayTurn pairs — each item has {human, assistant}

{{thread}}

{{thread}} renders the entire conversation as formatted text, ready to pass directly to a judge model. It’s the simplest way to give the scorer full conversation context. Example prompt:
Evaluate whether the assistant's responses across this conversation are helpful and on-topic.

Conversation:
{{thread}}

Return "A" if the assistant performed well, "B" if adequate, "C" if poor.

{{human_ai_pairs}}

For Nunjucks prompts, {{human_ai_pairs}} lets you iterate over matched turn pairs:
{% for pair in human_ai_pairs %}
Turn {{ loop.index }}:
  User: {{ pair.human.content }}
  Assistant: {{ pair.assistant.content }}
{% endfor %}

Were the assistant's responses appropriate throughout?
Pairs are matched by index (first user message with first assistant message, etc.). If the counts are unequal, only the matched pairs are included.

{{user_messages}} and {{assistant_messages}}

These filter the thread to a single role. Useful if you only need one side of the conversation:
Rate the clarity of the user's questions in this support conversation.

User messages:
{{#user_messages}}
- {{content}}
{{/user_messages}}

SDK requirements for trace-level scoring

Trace-level scorers require:
  • TypeScript SDK v2.2.1+
  • Python SDK v0.5.6+
  • Ruby SDK v0.2.1+

Setting up multi-turn conversation scoring

If your application creates a new trace per turn (common for chatbots), the easiest way to make {{thread}} work is to route all turns under a single root span using span.export(): Python:
import braintrust

# First turn — create the session root and export it
with braintrust.start_span(name="chat.session") as session_span:
    session_id = session_span.export()
    # persist session_id (e.g. session cookie, Redis, DB)

# Every subsequent turn — attach as a child
with braintrust.start_span(name="chat.turn", parent=session_id) as span:
    span.log(input={"messages": messages}, output=response)
TypeScript:
import { traced } from "braintrust";

// First turn — create and export the session root
let sessionId: string;
await traced(async (span) => {
  sessionId = await span.export();
}, { name: "chat.session" });

// Every turn — pass the same parent
await traced(async (span) => {
  span.log({ input: { messages }, output: response });
}, { name: "chat.turn", parent: sessionId });
Once all turns share a root trace, a Trace-scoped LLM-as-a-judge scorer with {{thread}} in the prompt will receive the full conversation.