Analyzing production logs

Use topics to automatically cluster production logs into named categories. Find patterns across hundreds of conversations and build targeted eval datasets.

All the assets for this module are available at braintrustdata/eval-101-course/module-13.

Why topics matter

When you have thousands of conversations logged every day, manually reviewing them does not scale. You need a way to automatically understand what your system is being asked to do and where it struggles. Topics solve this by clustering your logs into named categories.

Prerequisites

Topics requires at least 200 production logs to generate meaningful clusters. If you have fewer than that, generate additional logs by running your multi-turn chat application against a variety of inputs. The more diverse your logs, the better the clusters will be.

How topics work

Topic generation follows a four-step pipeline:

Preprocessing. Braintrust reads each trace and converts the raw conversation into a readable narrative.
Summary extraction. Each conversation is condensed into a one-line summary. For example: "Customer wants to return a jacket purchased 45 days ago. Agent explains the 30-day return policy."
Clustering. Summaries are grouped by semantic similarity and each group is given a natural language label. Instead of "Cluster 7," you get "Account access and billing issues."
Classification. Each existing log is assigned to one of the labels produced during clustering, so every past conversation ends up tagged with a topic.

Creating topic maps

Navigate to Topics in the left sidebar and select Generate topics. You can create multiple topic maps that categorize your logs along different dimensions.

For a customer support chatbot, you might create topic maps like:

By task type:

"Account access and billing issues" (44.7%, 88 traces)
"Product specifications and fit questions" (33%, 65 traces)
"Delivery and shipping concerns" (22.3%, 44 traces)

By sentiment:

"Frustration with service issues" (61%, 105 traces)
"Mixed emotions with engagement" (20.9%, 36 traces)
"Cooperative satisfaction" (18%, 31 traces)

By issue category:

Topic maps can also be generated around specific business dimensions, such as product type, resolution outcome, or escalation reason.

Visualizing clusters

The scatter plot view shows your topic clusters in 2D embedding space. Each point is a conversation, colored by its topic. This makes it easy to see how conversations relate to each other and to spot outliers that do not fit neatly into any cluster.

Saving and automating topics

Once you are satisfied with the generated topics, select Save topics. You can enable two automation toggles:

Classify incoming traces automatically assigns topics to new production logs as they arrive.
Process existing traces retroactively labels your historical data.

This means every new conversation gets tagged with a topic as soon as it is logged, without any manual effort.

Building eval datasets from topics

Topics become powerful when you combine them with the scores from online scoring. Filter your logs by a specific topic (for example, "Account access and billing issues"), then further filter by low scores to find the conversations where your system struggles most.

Select the most representative failing cases and create a dataset directly from the filtered logs. This gives you a focused eval dataset built from real production data, covering a specific category of interactions that you can test and improve independently.

What's next

In the next module, you'll close the loop. You'll take the production findings from topics and online scoring, turn them into test cases, run a baseline eval, test a fix, and verify the results.

Evals