Comparing experiments

Open the comparison view to analyze two experiments side by side. Read chain-of-thought reasoning, spot score differences, and compare token costs.

Analyzing your results

In the last module, you ran two experiments (the empathetic personality and the efficient personality) and saved them into Braintrust. Now you can analyze the results side by side to decide which personality to ship, based on actual numbers.

Opening the comparison view

Navigate to your project's Experiments tab. Select the two experiments you want to compare. Braintrust will show them side by side, with scores, outputs, and metadata for each test case.

Reading the diff

Toggle the diff mode to highlight differences between the two experiments. For each row in your dataset, you can see:

The input (the customer message)
Each persona's output (the generated response)
The Brand Alignment score for each persona
The chain-of-thought reasoning from the LLM judge

The chain-of-thought reasoning is especially useful. It tells you exactly why the judge gave a particular score. For example, the judge might note that the empathetic persona acknowledged the customer's frustration before addressing the issue, while the efficient persona jumped straight to the resolution.

Comparing costs

Beyond quality scores, the comparison view also shows token usage and cost per experiment. You might find that the empathetic persona uses significantly more tokens per response because it includes acknowledgment phrases and longer explanations. The efficient persona, being brief by design, costs less per response.

This is a real tradeoff. Higher brand alignment might come at higher token cost. The comparison view gives you the data to quantify that tradeoff.

Making a decision

With the comparison data in front of you, there are typically three paths forward:

Ship one persona. If one persona clearly outperforms the other on both quality and cost, the decision is straightforward. For example, if the efficient persona scores just as high on brand alignment but costs 3x less, that's a strong signal.
Iterate. Maybe neither persona is quite right. The empathetic persona scores well on tone but is too verbose. The efficient persona is cost-effective but misses on empathy. You could write a third persona that combines the strengths of both and run another experiment.
Segment by use case. Different types of customer messages might call for different approaches. Billing complaints might benefit from empathy, while factual questions might benefit from efficiency. You can use tags in your dataset to analyze performance by category.

The point is that you now have data to drive the decision, not intuition.

What's next

In the next lesson, you'll learn the difference between playgrounds and experiments, and when to use each one.

Evals