ABCDEFGHIJKLMNOPQRSTUVWXYZ

Encyclopedia Evalica / Evaluation / Non-determinism

Non-determinism

/nahn.dih'ter.muh.nih.zuhm/The property of AI systems where the same input can produce different outputs across identical requests. (noun)

Why it matters

Non-determinism is the fundamental reason AI systems need evals rather than just testing. When the same input can produce different outputs on every run, a single passing test tells you almost nothing. You need to measure quality across distributions of outputs, track score trends over time, and build datasets large enough to capture the variance. This also changes how you think about regressions. A prompt change might improve average quality while making edge cases worse, but this won't be identified without running your eval suite across enough examples to see the distribution shift.

“Because of non-determinism, we evaluate changes on distributions of scores, not one-off outputs.”

Customer example

Notion embraced non-determinism by moving ~70 engineers beyond "vibe checks" to systematic evals from feedback and traces, so the team can ship quickly even as agent behaviors and outcomes vary across runs. Read more

Related Evaluation terms

Absolute scoring

•

Agent

•

AI eval

•

Alignment

•

Annotation schema

•

Baseline

•

Baseline experiment

•

Benchmark

•

Calibration

•

CI/CD integration

•

Coherence

•

Confidence interval

•

Eval harness

•

Eval leakage

•

Experiment

•

Factuality

•

Failure mode

•

Faithfulness

•

Feedback signal

•

Groundedness

•

Hallucination

•

Inter-annotator agreement (IAA)

•

LLM-as-a-judge

•

Loop

•

Model comparison

•

Multimodal

•

Offline evaluation

•

Pairwise evaluation

•

Pass@k

•

Playground

•

Quality gate

•

RAG (retrieval-augmented generation)

•

RAG evaluation

•

Reference-based scoring

•

Reference-free scoring

•

Regression testing

•

Release criteria

•

Remote evaluation

•

Rubric

•

Safety

•

Score distribution

•

Scorer

•

Semantic failure

•

Signal-to-noise ratio

•

Task (eval task)

•

Toxicity score

From the docs

Evaluate systematically

•

Create experiments

•

Observe your application

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.