Encyclopedia Evalica / Evaluation / Hallucination

Hallucination
/huh.loo.suh'nay.shuhn/When a model produces output that is not supported by the provided context or by reliable external facts, but presents it as if it were true. Hallucinations are a common form of semantic failure. (noun)
Why it matters
Hallucinations are uniquely dangerous because they look correct. A model that returns an error is easy to catch, but a model that confidently cites a nonexistent policy or invents a plausible-sounding statistic can pass through human review if the reviewer is not an expert on the topic. This makes systematic detection essential. You need scorers that check whether outputs are grounded in the provided context (for RAG systems) or consistent with known facts (for open-ended generation). These checks should run both in offline evals and on production traffic, because hallucination rates can shift when input distributions change or when you swap models. Building a dataset of known hallucination-prone inputs and running it as a regression suite gives you an early warning system. The goal is not to eliminate hallucinations entirely, which remains an open research problem, but to measure their rate, understand their patterns, and catch regressions before they reach your developers or end users.
“The model hallucinated and cited a policy that doesn't exist.”
Customer example
Graphite's AI code reviewer is evaluated on whether feedback is actionable and relevant without hallucinations; they curate datasets from real PR interactions and score model variants before deployment to reduce ungrounded suggestions. Read more
Related Evaluation terms
- Absolute scoring •
- Agent •
- AI eval •
- Alignment •
- Annotation schema •
- Baseline •
- Baseline experiment •
- Benchmark •
- Calibration •
- CI/CD integration •
- Coherence •
- Confidence interval •
- Eval harness •
- Eval leakage •
- Experiment •
- Factuality •
- Failure mode •
- Faithfulness •
- Feedback signal •
- Groundedness •
- Inter-annotator agreement (IAA) •
- LLM-as-a-judge •
- Loop •
- Model comparison •
- Multimodal •
- Non-determinism •
- Offline evaluation •
- Pairwise evaluation •
- Pass@k •
- Playground •
- Quality gate •
- RAG (retrieval-augmented generation) •
- RAG evaluation •
- Reference-based scoring •
- Reference-free scoring •
- Regression testing •
- Release criteria •
- Remote evaluation •
- Rubric •
- Safety •
- Score distribution •
- Scorer •
- Semantic failure •
- Signal-to-noise ratio •
- Task (eval task) •
- Toxicity score
From the docs
Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.
Start building