Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Navigate traces in the Braintrust UI. Understand span types (root, LLM, scorer, function, task, tool) and use chain-of-thought reasoning to debug scores.
A trace is a complete record of a single eval run, from the moment an input is handed to your task function to when the scores are computed. Every row in your experiment table corresponds to one trace. When you select a row, you open that trace.
For the customer support eval, each trace represents one customer message going through the full pipeline: the input comes in, the task function processes it, the LLM generates a response, and the scorers evaluate the result.
Within each trace, there are spans. A span is a single unit of work inside a trace. Braintrust recognizes several span types:
@traced decorator or similar instrumentation.Each span records start and end timestamps, so you can see exactly how long each step took.
Open any row in your experiment table to view its trace. The left side shows the span tree, and the right side shows the details of the selected span.
For a customer support eval, the trace tree might look like this:
eval (root span)
└─ task
├─ Chat Completion (LLM span)
└─ Brand Alignment (scoring span)
The root eval span contains the top-level input and output. Inside it, the task span represents your task function. The Chat Completion span captures the actual LLM call with the model name, the full message history, the completion, token counts, latency, and cost. The Brand Alignment span shows the score and the judge's chain-of-thought reasoning.
When you select the LLM span, you can inspect:
When you select a scoring span, you can read the scorer's reasoning. This is especially useful for LLM-as-judge scorers where the judge explains why it assigned a particular score.
When a score looks wrong, the trace is where you go to figure out why. Open the scorer span and read the chain-of-thought reasoning. Common patterns:
Traces turn debugging from guesswork into a structured process. Instead of re-running the eval and hoping for a different result, you can pinpoint exactly where the pipeline broke down.
In the next module, you'll learn how to analyze your eval results at scale, using experiment comparison, Loop queries, and the Braintrust MCP server.