Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Open the comparison view to analyze two experiments side by side. Read chain-of-thought reasoning, spot score differences, and compare token costs.
In the last module, you ran two experiments (the empathetic personality and the efficient personality) and saved them into Braintrust. Now you can analyze the results side by side to decide which personality to ship, based on actual numbers.
Navigate to your project's Experiments tab. Select the two experiments you want to compare. Braintrust will show them side by side, with scores, outputs, and metadata for each test case.
Toggle the diff mode to highlight differences between the two experiments. For each row in your dataset, you can see:
The chain-of-thought reasoning is especially useful. It tells you exactly why the judge gave a particular score. For example, the judge might note that the empathetic persona acknowledged the customer's frustration before addressing the issue, while the efficient persona jumped straight to the resolution.
Beyond quality scores, the comparison view also shows token usage and cost per experiment. You might find that the empathetic persona uses significantly more tokens per response because it includes acknowledgment phrases and longer explanations. The efficient persona, being brief by design, costs less per response.
This is a real tradeoff. Higher brand alignment might come at higher token cost. The comparison view gives you the data to quantify that tradeoff.
With the comparison data in front of you, there are typically three paths forward:
The point is that you now have data to drive the decision, not intuition.
In the next lesson, you'll learn the difference between playgrounds and experiments, and when to use each one.