ABCDEFGHIJKLMNOPQRSTUVWXYZ

Encyclopedia Evalica / Evaluation / RAG (retrieval-augmented generation)

RAG (retrieval-augmented generation)

/rag rih'tree.vuhl awg'mehn.tuhd jeh.nuh'ray.shuhn/A pattern where relevant context is retrieved from an external source and included in the prompt before generation. RAG systems have distinct eval needs around retrieval quality and answer groundedness. (noun)

Why it matters

RAG systems have two distinct failure modes, and conflating them makes debugging nearly impossible. Retrieval can fail by returning irrelevant or incomplete context, and generation can fail by ignoring, misquoting, or hallucinating beyond the retrieved context. If you only measure end-to-end answer quality, you cannot tell which component is broken. Effective RAG eval scores retrieval quality (did you fetch the right documents?) and generation quality (did you use them faithfully?) independently, then also measures the combined result. This decomposition tells you where to invest effort. Maybe your retriever needs better chunking, or maybe the model needs a stronger grounding prompt. Without separate signals, you end up tuning the wrong knob. Teams building production RAG systems also need to track these metrics on live traffic, because real queries drift from the examples you anticipated during development.

“We added RAG so the AI assistant could answer quries using the latest internal docs.”

Customer example

Dropbox built its AI search product (Dash) as a RAG system, scoring for grounding and citations while using production traces to see what source objects were retrieved, then feeding failures back into new test sets. Read more

Related Evaluation terms

Absolute scoring

•

Agent

•

AI eval

•

Alignment

•

Annotation schema

•

Baseline

•

Baseline experiment

•

Benchmark

•

Calibration

•

CI/CD integration

•

Coherence

•

Confidence interval

•

Eval harness

•

Eval leakage

•

Experiment

•

Factuality

•

Failure mode

•

Faithfulness

•

Feedback signal

•

Groundedness

•

Hallucination

•

Inter-annotator agreement (IAA)

•

LLM-as-a-judge

•

Loop

•

Model comparison

•

Multimodal

•

Non-determinism

•

Offline evaluation

•

Pairwise evaluation

•

Pass@k

•

Playground

•

Quality gate

•

RAG evaluation

•

Reference-based scoring

•

Reference-free scoring

•

Regression testing

•

Release criteria

•

Remote evaluation

•

Rubric

•

Safety

•

Score distribution

•

Scorer

•

Semantic failure

•

Signal-to-noise ratio

•

Task (eval task)

•

Toxicity score

From the docs

Evaluate systematically

•

Create scorers

•

Build datasets

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.