Encyclopedia Evalica / Observability / AI observability

AI observability

/ay.eye uhb.zer.vuh'bih.luh.tee/The infrastructure and practices that let you see what your AI system did, measure whether it was good, and systematically improve it. It connects tracing, evals, and iteration into a single workflow so that production behavior feeds directly into quality improvement. (noun)

Why it matters

Traditional application monitoring tracks uptime, error rates, and latency, but those metrics tell you almost nothing about whether your AI system produced a good output. A model can return 200 OK, respond in 300ms, and still hallucinate in a way that damages trust. AI observability closes that gap by connecting three capabilities that are usually siloed. Tracing shows you exactly what happened inside a request. Evals tell you whether the result was good. And iteration tooling lets you go from a bad trace to a better prompt or scorer without context-switching between systems. When these pieces live in one workflow, production problems turn into dataset records, dataset records turn into experiments, and experiments turn into measured improvements. Without that connected loop, teams end up debugging by reading individual logs and guessing at fixes, which breaks down as soon as you have more than a handful of daily active conversations.

AI observability helped us trace why the assistant failed and verify if our fix improved eval scores.

Customer example

Notion uses Braintrust for AI observability across ~70 engineers, deploying frontier models in <24 hours and finding "needle-in-a-haystack" quality issues by searching traces and turning them into targeted eval datasets. Read more

Related Observability terms

From the docs

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.

Start building