Building a multi-turn chat app

Evolve the chatbot into a multi-turn CLI app with production logging. Use init_logger, wrap_openai, and @traced to capture every conversation as a single trace.

All the assets for this module are available at braintrustdata/eval-101-course/module-10.

From single-turn to multi-turn

So far, the customer support bot has been single-turn. One input in, one output out. In reality, a customer writes in, the bot responds, and the customer responds back. The conversation continues until the issue is resolved.

In this module, you'll evolve the chatbot into a multi-turn application and instrument it with Braintrust logging so every conversation gets captured.

Key differences from eval instrumentation

When you run evals with Eval(), Braintrust handles tracing automatically. For a production application, you use a different set of tools:

init_logger replaces Eval as the entry point. It creates a persistent connection to Braintrust for logging production traffic.
wrap_openai wraps the OpenAI client so that every LLM call is automatically captured as a span.
@traced creates a function span around any function you want to instrument.
logger.start_span lets you manually create spans, which is useful for grouping multiple turns into a single root span.

Setting up the logger

Initialize the Braintrust logger at the top of your application:

python

import braintrust
import openai
from braintrust import traced, wrap_openai

logger = braintrust.init_logger(project="Customer Support Chatbot")
client = wrap_openai(openai.OpenAI())

init_logger creates the connection to Braintrust. wrap_openai wraps the OpenAI client so that every chat.completions.create call is automatically logged as an LLM span with model name, messages, completion, token counts, and latency.

Building the conversation loop

A multi-turn chat app maintains a running list of messages and appends each new turn:

python

SYSTEM_PROMPT = {
    "role": "system",
    "content": "You are an efficient, no-nonsense customer support agent. Get straight to the point. Be polite but brief.",
}


@traced
def handle_message(messages):
    response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
    return response.choices[0].message.content

The @traced decorator creates a function span for each call, and wrap_openai creates a nested LLM span for the model call inside it.

Grouping turns into a conversation

The important part is grouping all turns of a conversation into a single trace. Use logger.start_span to create a root span that wraps the entire chat session:

python

with logger.start_span(name="conversation") as span:
    messages = [SYSTEM_PROMPT]
    while True:
        user_input = input("You: ")
        if user_input.lower() in ["quit", "exit"]:
            break
        messages.append({"role": "user", "content": user_input})
        response = handle_message(messages)
        messages.append({"role": "assistant", "content": response})
        print(f"Agent: {response}")

Without this grouping, each turn would appear as a separate trace. With it, you get a single trace per conversation that shows the full multi-turn interaction, making it much easier to understand how the conversation unfolded.

What the trace looks like

After running a multi-turn conversation, open the trace in Braintrust. The span tree shows:

conversation (root span)
  ├─ handle_message (function span, turn 1)
  │   └─ Chat Completion (LLM span)
  ├─ handle_message (function span, turn 2)
  │   └─ Chat Completion (LLM span)
  └─ handle_message (function span, turn 3)
      └─ Chat Completion (LLM span)

Each turn is a separate function span with its own nested LLM span. You can select any span to see the full message history at that point, the model's response, token counts, and latency.

What's next

In the next module, you'll learn how to score multi-turn conversations. Individual turns might each look fine, but the conversation as a whole can still fail. Trace-level scoring catches problems that per-turn scoring misses.

Evals

From single-turn to multi-turn

Key differences from eval instrumentation

Setting up the logger

Building the conversation loop

Grouping turns into a conversation

What the trace looks like

What's next

Further reading

Trace everything