How to deal with nondeterminism

Why the same eval can produce different scores across runs. How temperature affects variance and how trial_count averages results for reliable signal.

All the assets for this module are available at braintrustdata/eval-101-course/module-07.

The same eval, different scores

Now that you've run the same eval in the UI and in code, there's something important to understand before going further. Running the same eval doesn't always produce the same scores, even if you're using the same dataset, the same task, and the same scorer.

This is expected behavior, not a bug. Understanding why it happens and how to account for it will help you make better decisions based on your eval results.

Why scores vary between runs

LLMs are non-deterministic by nature. Even with the same prompt and the same input, a model can produce slightly different outputs each time. This variation comes from the temperature parameter, which controls randomness in the model's output.

  • At temperature=0, the model always picks the highest-probability token. Outputs are nearly identical across runs (though not always perfectly so).
  • At higher temperature values, the model samples more broadly, introducing more variation.

Most models default to a temperature above zero, which means some randomness is baked in.

This non-determinism affects your evals in two places:

  1. The task output. Your AI system might generate a slightly different response each time, even for the same input.
  2. The scorer output. If you're using an LLM-as-judge scorer, the judge itself is also non-deterministic. It might score the same response slightly differently on different runs.

Using trial_count to average out noise

Braintrust provides a trial_count parameter that helps you deal with this variance. When you set trial_count to a value greater than 1, Braintrust runs each test case multiple times and averages the scores across those runs.

For example, if you set trial_count=3, each input in your dataset gets evaluated three times. The final score for that input is the average of the three individual scores. This smooths out the noise from non-determinism and gives you a more reliable signal.

python
from braintrust import Eval

Eval(
    "Customer Support Chatbot",
    data=lambda: [
        {"input": "How do I reset my password?"},
        {"input": "My order never arrived."},
        {"input": "Can I get a refund?"},
        {"input": "Your app keeps crashing on iOS."},
    ],
    task=task,
    scores=[brand_alignment],
    trial_count=3,
)

Higher trial counts produce more stable results but cost more (each trial is an additional LLM call for both the task and the scorer). A trial count of 3 to 5 is a reasonable starting point for most use cases.

When to use trial_count

You don't always need multiple trials. If you're using deterministic scorers (like exact match) and your task output is stable (temperature set to 0), a single trial is fine.

Multiple trials are most valuable when:

  • Your scorer is an LLM-as-judge
  • Your task uses a temperature above 0
  • You're comparing two experiments and the score difference is small
  • You need high confidence in your results before making a shipping decision

What's next

You now know how to build evals in the UI and in code, compare experiments, and account for non-determinism. In the next module, you'll learn how to read a trace in the Braintrust UI so you can debug individual eval runs and understand exactly what happened at each step.

Further reading

Trace everything