ABCDEFGHIJKLMNOPQRSTUVWXYZ

Encyclopedia Evalica / Evaluation / Semantic failure

Semantic failure

/suh'man.tihk 'fay.lyer/When an AI system is operationally healthy (low latency, no errors) but producing outputs that are factually wrong, off-brand, or harmful. It's invisible to infrastructure monitoring and only detectable through evals. (noun)

Why it matters

Semantic failures are the defining challenge of AI quality. Your infrastructure monitoring will show green dashboards while your AI gives wrong answers, because a 200 OK response with a hallucinated answer looks identical to a correct one at the HTTP level. The only way to catch it is with evals, whether that means scoring production traces, running regression suites, or having human reviewers flag bad outputs. Teams that rely solely on traditional monitoring will miss the majority of their AI quality issues.

“Latency was great, but the assistant kept giving wrong refund rules, so it was actually a semantic failure.”

Customer example

Retool discovered a semantic failure where an agent confidently claimed it completed a task when it hadn't. By analyzing traces with Loop, they found the root cause (missing tool definitions) and fixed the underlying system issue. Read more

Related Evaluation terms

Absolute scoring

•

Agent

•

AI eval

•

Alignment

•

Annotation schema

•

Baseline

•

Baseline experiment

•

Benchmark

•

Calibration

•

CI/CD integration

•

Coherence

•

Confidence interval

•

Eval harness

•

Eval leakage

•

Experiment

•

Factuality

•

Failure mode

•

Faithfulness

•

Feedback signal

•

Groundedness

•

Hallucination

•

Inter-annotator agreement (IAA)

•

LLM-as-a-judge

•

Loop

•

Model comparison

•

Multimodal

•

Non-determinism

•

Offline evaluation

•

Pairwise evaluation

•

Pass@k

•

Playground

•

Quality gate

•

RAG (retrieval-augmented generation)

•

RAG evaluation

•

Reference-based scoring

•

Reference-free scoring

•

Regression testing

•

Release criteria

•

Remote evaluation

•

Rubric

•

Safety

•

Score distribution

•

Scorer

•

Signal-to-noise ratio

•

Task (eval task)

•

Toxicity score

From the docs

Evaluate systematically

•

Observe your application

•

Create scorers

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.

Start building

← Scorer

Manifesto

Service Level Indicator (SLI) →