Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Evolve the chatbot into a multi-turn CLI app with production logging. Use init_logger, wrap_openai, and @traced to capture every conversation as a single trace.
All the assets for this module are available at braintrustdata/eval-101-course/module-10.
So far, the customer support bot has been single-turn. One input in, one output out. In reality, a customer writes in, the bot responds, and the customer responds back. The conversation continues until the issue is resolved.
In this module, you'll evolve the chatbot into a multi-turn application and instrument it with Braintrust logging so every conversation gets captured.
When you run evals with Eval(), Braintrust handles tracing automatically. For a production application, you use a different set of tools:
init_logger replaces Eval as the entry point. It creates a persistent connection to Braintrust for logging production traffic.wrap_openai wraps the OpenAI client so that every LLM call is automatically captured as a span.@traced creates a function span around any function you want to instrument.logger.start_span lets you manually create spans, which is useful for grouping multiple turns into a single root span.Initialize the Braintrust logger at the top of your application:
import braintrust
import openai
from braintrust import traced, wrap_openai
logger = braintrust.init_logger(project="Customer Support Chatbot")
client = wrap_openai(openai.OpenAI())
init_logger creates the connection to Braintrust. wrap_openai wraps the OpenAI client so that every chat.completions.create call is automatically logged as an LLM span with model name, messages, completion, token counts, and latency.
A multi-turn chat app maintains a running list of messages and appends each new turn:
SYSTEM_PROMPT = {
"role": "system",
"content": "You are an efficient, no-nonsense customer support agent. Get straight to the point. Be polite but brief.",
}
@traced
def handle_message(messages):
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
return response.choices[0].message.content
The @traced decorator creates a function span for each call, and wrap_openai creates a nested LLM span for the model call inside it.
The important part is grouping all turns of a conversation into a single trace. Use logger.start_span to create a root span that wraps the entire chat session:
with logger.start_span(name="conversation") as span:
messages = [SYSTEM_PROMPT]
while True:
user_input = input("You: ")
if user_input.lower() in ["quit", "exit"]:
break
messages.append({"role": "user", "content": user_input})
response = handle_message(messages)
messages.append({"role": "assistant", "content": response})
print(f"Agent: {response}")
Without this grouping, each turn would appear as a separate trace. With it, you get a single trace per conversation that shows the full multi-turn interaction, making it much easier to understand how the conversation unfolded.
After running a multi-turn conversation, open the trace in Braintrust. The span tree shows:
conversation (root span)
├─ handle_message (function span, turn 1)
│ └─ Chat Completion (LLM span)
├─ handle_message (function span, turn 2)
│ └─ Chat Completion (LLM span)
└─ handle_message (function span, turn 3)
└─ Chat Completion (LLM span)
Each turn is a separate function span with its own nested LLM span. You can select any span to see the full message history at that point, the model's response, token counts, and latency.
In the next module, you'll learn how to score multi-turn conversations. Individual turns might each look fine, but the conversation as a whole can still fail. Trace-level scoring catches problems that per-turn scoring misses.