Encyclopedia Evalica / Evaluation / Semantic failure

Semantic failure

/suh'man.tihk 'fay.lyer/When an AI system is operationally healthy (low latency, no errors) but producing outputs that are factually wrong, off-brand, or harmful. It's invisible to infrastructure monitoring and only detectable through evals. (noun)

Why it matters

Semantic failures are the defining challenge of AI quality. Your infrastructure monitoring will show green dashboards while your AI gives wrong answers, because a 200 OK response with a hallucinated answer looks identical to a correct one at the HTTP level. The only way to catch it is with evals, whether that means scoring production traces, running regression suites, or having human reviewers flag bad outputs. Teams that rely solely on traditional monitoring will miss the majority of their AI quality issues.

Latency was great, but the assistant kept giving wrong refund rules, so it was actually a semantic failure.

Customer example

Retool discovered a semantic failure where an agent confidently claimed it completed a task when it hadn't. By analyzing traces with Loop, they found the root cause (missing tool definitions) and fixed the underlying system issue. Read more

Related Evaluation terms

From the docs

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.

Start building