Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Why the same eval can produce different scores across runs. How temperature affects variance and how trial_count averages results for reliable signal.
All the assets for this module are available at braintrustdata/eval-101-course/module-07.
Now that you've run the same eval in the UI and in code, there's something important to understand before going further. Running the same eval doesn't always produce the same scores, even if you're using the same dataset, the same task, and the same scorer.
This is expected behavior, not a bug. Understanding why it happens and how to account for it will help you make better decisions based on your eval results.
LLMs are non-deterministic by nature. Even with the same prompt and the same input, a model can produce slightly different outputs each time. This variation comes from the temperature parameter, which controls randomness in the model's output.
temperature=0, the model always picks the highest-probability token. Outputs are nearly identical across runs (though not always perfectly so).Most models default to a temperature above zero, which means some randomness is baked in.
This non-determinism affects your evals in two places:
Braintrust provides a trial_count parameter that helps you deal with this variance. When you set trial_count to a value greater than 1, Braintrust runs each test case multiple times and averages the scores across those runs.
For example, if you set trial_count=3, each input in your dataset gets evaluated three times. The final score for that input is the average of the three individual scores. This smooths out the noise from non-determinism and gives you a more reliable signal.
from braintrust import Eval
Eval(
"Customer Support Chatbot",
data=lambda: [
{"input": "How do I reset my password?"},
{"input": "My order never arrived."},
{"input": "Can I get a refund?"},
{"input": "Your app keeps crashing on iOS."},
],
task=task,
scores=[brand_alignment],
trial_count=3,
)
Higher trial counts produce more stable results but cost more (each trial is an additional LLM call for both the task and the scorer). A trial count of 3 to 5 is a reasonable starting point for most use cases.
You don't always need multiple trials. If you're using deterministic scorers (like exact match) and your task output is stable (temperature set to 0), a single trial is fine.
Multiple trials are most valuable when:
You now know how to build evals in the UI and in code, compare experiments, and account for non-determinism. In the next module, you'll learn how to read a trace in the Braintrust UI so you can debug individual eval runs and understand exactly what happened at each step.