Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Use topics to automatically cluster production logs into named categories. Find patterns across hundreds of conversations and build targeted eval datasets.
All the assets for this module are available at braintrustdata/eval-101-course/module-13.
When you have thousands of conversations logged every day, manually reviewing them does not scale. You need a way to automatically understand what your system is being asked to do and where it struggles. Topics solve this by clustering your logs into named categories.
Topics requires at least 200 production logs to generate meaningful clusters. If you have fewer than that, generate additional logs by running your multi-turn chat application against a variety of inputs. The more diverse your logs, the better the clusters will be.
Topic generation follows a four-step pipeline:
Navigate to Topics in the left sidebar and select Generate topics. You can create multiple topic maps that categorize your logs along different dimensions.
For a customer support chatbot, you might create topic maps like:
By task type:
By sentiment:
By issue category:
The scatter plot view shows your topic clusters in 2D embedding space. Each point is a conversation, colored by its topic. This makes it easy to see how conversations relate to each other and to spot outliers that do not fit neatly into any cluster.
Once you are satisfied with the generated topics, select Save topics. You can enable two automation toggles:
This means every new conversation gets tagged with a topic as soon as it is logged, without any manual effort.
Topics become powerful when you combine them with the scores from online scoring. Filter your logs by a specific topic (for example, "Account access and billing issues"), then further filter by low scores to find the conversations where your system struggles most.
Select the most representative failing cases and create a dataset directly from the filtered logs. This gives you a focused eval dataset built from real production data, covering a specific category of interactions that you can test and improve independently.
In the next module, you'll close the loop. You'll take the production findings from topics and online scoring, turn them into test cases, run a baseline eval, test a fix, and verify the results.