How to read a trace

Navigate traces in the Braintrust UI. Understand span types (root, LLM, scorer, function, task, tool) and use chain-of-thought reasoning to debug scores.

What is a trace

A trace is a complete record of a single eval run, from the moment an input is handed to your task function to when the scores are computed. Every row in your experiment table corresponds to one trace. When you select a row, you open that trace.

For the customer support eval, each trace represents one customer message going through the full pipeline: the input comes in, the task function processes it, the LLM generates a response, and the scorers evaluate the result.

Span types

Within each trace, there are spans. A span is a single unit of work inside a trace. Braintrust recognizes several span types:

Root span. Every trace has one. It contains the top-level input, output, and final scores.
LLM span. A single LLM call, including the model name, input messages, output, and token counts.
Scoring span. The score name, the score produced, and chain-of-thought reasoning if enabled.
Function span. A block of code wrapped with the @traced decorator or similar instrumentation.
Task span. A unit of work that produces a meaningful result.
Tool span. An external API call or tool invocation.

Each span records start and end timestamps, so you can see exactly how long each step took.

Reading a trace in the UI

Open any row in your experiment table to view its trace. The left side shows the span tree, and the right side shows the details of the selected span.

For a customer support eval, the trace tree might look like this:

eval (root span)
  └─ task
      ├─ Chat Completion (LLM span)
      └─ Brand Alignment (scoring span)

The root eval span contains the top-level input and output. Inside it, the task span represents your task function. The Chat Completion span captures the actual LLM call with the model name, the full message history, the completion, token counts, latency, and cost. The Brand Alignment span shows the score and the judge's chain-of-thought reasoning.

What to look for in each span

When you select the LLM span, you can inspect:

The model name and parameters
The full input messages (system prompt, user message, any few-shot examples)
The model's output
Token counts (prompt tokens and completion tokens)
Duration and time to first token
Whether the response was cached
Estimated cost

When you select a scoring span, you can read the scorer's reasoning. This is especially useful for LLM-as-judge scorers where the judge explains why it assigned a particular score.

Debugging with traces

When a score looks wrong, the trace is where you go to figure out why. Open the scorer span and read the chain-of-thought reasoning. Common patterns:

The scorer misunderstood the rubric. Refine your scoring prompt.
The model's response was technically correct but missed the tone. Adjust your system prompt.
The expected output in the dataset was wrong or outdated. Fix the test case.

Traces turn debugging from guesswork into a structured process. Instead of re-running the eval and hoping for a different result, you can pinpoint exactly where the pipeline broke down.

What's next

In the next module, you'll learn how to analyze your eval results at scale, using experiment comparison, Loop queries, and the Braintrust MCP server.

Evals