Latest articles

RAG Evaluation Metrics: How to evaluate your RAG pipeline with Braintrust

5 November 2025Braintrust Team

Retrieval-augmented generation (RAG) systems ground LLM responses in your data. The theory sounds simple: retrieve relevant documents, pass them to your language model, generate accurate answers. In practice, RAG pipelines fail subtly. Retrieval surfaces irrelevant documents. Models hallucinate despite correct context. Answers are factually accurate but miss the question.

Production reveals the challenge. Users ask questions your retrieval never saw. Edge cases multiply. Performance degrades silently. Without systematic evaluation, you discover problems only when users lose trust.

Why RAG evaluation differs from standard LLM evaluation

Standard LLM evaluation focuses on output quality: accuracy, relevance, tone. RAG evaluation measures the entire pipeline. Retrieval quality determines whether your model has the information to answer. Context utilization determines whether models use retrieved documents. Answer grounding determines whether responses stay faithful to sources.

Each component fails independently. Perfect retrieval fails if models ignore context. Perfect generation fails if retrieval surfaced wrong documents. Evaluating RAG requires measuring both retrieval and generation, understanding their interaction, and tracking quality through the pipeline.

Core RAG evaluation metrics

Answer relevancy

Answer relevancy measures whether your response actually addresses the user's question. A factually accurate answer that discusses the wrong topic scores poorly on relevancy.

Consider a user asking "What is our return policy for electronics?" If your system responds with accurate information about clothing returns, the answer fails relevancy despite being factually correct. The response needs to match both the topic (electronics) and the intent (return policy).

Measure relevancy by comparing questions to answers. Use LLM-as-a-judge scorers that evaluate whether answers address the specific question asked. Include edge cases where questions contain ambiguity or multiple sub-questions.

Braintrust makes answer relevancy scoring straightforward through built-in scorers that compare semantic similarity between questions and answers, or through custom LLM judges that evaluate relevancy according to your specific criteria.

Faithfulness and groundedness

Faithfulness measures whether answers stay true to retrieved context. Hallucination in RAG systems defeats the entire purpose of retrieval augmentation. If your model invents facts despite having correct source documents, users cannot trust your system.

Groundedness evaluation verifies that every claim in your answer can be traced back to retrieved context. This requires checking individual statements against source documents. A response might be 90% grounded with one hallucinated sentence that undermines trust completely.

Test faithfulness by comparing generated answers against retrieved documents. Look for claims that appear in answers but not in source material. Track whether models tend to hallucinate specific types of information (dates, numbers, names).

Implementation approaches include fact verification scorers that check each claim against source documents, consistency scorers that flag contradictions between answers and context, and attribution scorers that verify all statements have document support.

Context precision

Context precision measures whether retrieved documents actually contain information needed to answer the question. High precision means most retrieved documents prove useful. Low precision means your retrieval system surfaces many irrelevant documents.

Irrelevant context wastes token budget and confuses models. When half your retrieved documents discuss unrelated topics, your model must sort through noise to find signal. This increases latency, costs, and error rates.

Evaluate context precision by having judges assess each retrieved document for relevance to the question. Calculate the ratio of relevant to total retrieved documents. Track precision across different query types to identify where retrieval struggles.

Context recall

Context recall measures whether your retrieval captures all relevant information available in your knowledge base. High recall means you retrieve all useful documents. Low recall means relevant information gets missed.

The tradeoff between precision and recall defines RAG system design. Retrieving 20 documents increases recall but decreases precision. Retrieving 3 documents increases precision but risks missing relevant information. Optimal balance depends on your use case.

Measure recall by manually identifying all relevant documents for test queries, then checking what percentage your system retrieves. This requires building golden datasets where you know the correct documents for each question.

Answer similarity and correctness

For questions with known correct answers, measure how closely generated responses match expected answers. This works well for factual questions with objective answers.

Use semantic similarity metrics to compare generated and reference answers. Exact string matching fails because multiple phrasings can be correct. Semantic similarity captures whether answers convey the same information even with different wording.

Combine similarity scores with correctness evaluation. Similar answers might both be wrong. Correctness scoring verifies factual accuracy against ground truth, while similarity measures phrasing alignment.

Retrieval precision and recall metrics

Beyond document-level context metrics, measure retrieval system performance directly. Track precision at K (what percentage of top-K results are relevant), mean average precision (MAP) across queries, and normalized discounted cumulative gain (NDCG) to account for ranking quality.

These metrics help diagnose retrieval problems separate from generation issues. If retrieval precision is high but answer quality is low, your generation prompt needs work. If retrieval precision is low, your embedding model or chunking strategy needs improvement.

Building a RAG evaluation framework with Braintrust

Trace the entire pipeline

RAG evaluation requires visibility into every step. Braintrust traces capture this naturally. Each pipeline step becomes a span with inputs, outputs, and metadata. When evaluation scores are low, drill into traces to see exactly where the pipeline failed.

In the code below, we use the @traced decorator to instrument each component of the RAG pipeline: retrieval, generation, and the overall orchestration. This creates a hierarchical trace structure that logs queries, retrieved documents with relevance scores, context passed to the model, and generated answers.

python
from braintrust import traced


@traced
def retrieve_documents(query: str, top_k: int = 5):
    # Your retrieval logic
    documents = search_index(query, top_k=top_k)
    return documents


@traced
def generate_answer(query: str, documents: list[str]):
    # Your generation logic
    context = "\n".join(documents)
    answer = llm.generate(query=query, context=context)
    return answer


@traced
def rag_pipeline(query: str):
    documents = retrieve_documents(query)
    answer = generate_answer(query, documents)
    return {"query": query, "documents": documents, "answer": answer}

This structure separates retrieval and generation into distinct spans. When debugging low scores, you can pinpoint whether retrieval surfaced wrong documents or generation misused correct context.

Create evaluation datasets from production

The best RAG evaluation data comes from real user queries. Production reveals question patterns you never anticipated. Users phrase questions differently than developers expect. Edge cases emerge organically. By collecting production data, you build a golden dataset of real user queries that become the foundation for measuring improvements.

Start collecting production queries immediately with Braintrust. Tag interesting cases: successful answers, failures, ambiguous questions, and edge cases. This golden dataset becomes your regression test suite and quality benchmark.

Include diverse query types. Simple factual questions with clear answers. Complex questions requiring synthesis across documents. Ambiguous questions where multiple interpretations are valid. Questions that should return "I don't know" when information is unavailable.

Define scorers for each metric

RAG evaluation requires multiple scorers covering different quality dimensions. Braintrust supports several scorer types optimized for RAG pipelines:

Answer relevancy scorers measure whether responses address the user's question. Use semantic similarity metrics to compare questions and answers, or configure LLM-as-a-judge scorers with rubrics defining what constitutes a relevant answer. These catch cases where factually correct answers discuss the wrong topic.

Faithfulness and groundedness scorers verify that answers stay true to retrieved context. LLM-as-a-judge scorers can extract claims from answers and check each against source documents. Track the ratio of supported to total claims. These scorers catch hallucination despite having correct context.

Context quality scorers evaluate retrieval effectiveness. Measure whether retrieved documents contain information needed to answer the question. Track precision (percentage of relevant docs) and recall (percentage of available relevant docs retrieved). Use both automated scoring and human labels for your golden dataset.

Retrieval ranking scorers assess whether the most relevant documents appear first in results. Implement NDCG (normalized discounted cumulative gain) or MAP (mean average precision) to measure ranking quality. These metrics matter because models may not fully utilize lower-ranked documents.

Completeness scorers check whether answers include all necessary information. For factual questions, verify all required details appear. For procedural questions, ensure all steps are covered. Use LLM-as-a-judge with detailed rubrics defining completeness for your domain.

The code below demonstrates a faithfulness scorer that verifies answer claims against retrieved context. We create a scorer function that extracts the generated answer and retrieved documents, then uses an LLM to verify each claim in the answer appears in the source documents:

python
from braintrust import Scorer

faithfulness_scorer = Scorer(
    name="Answer Faithfulness",
    scoring_function=check_faithfulness,
    metadata={"description": "Checks if answer claims are supported by context"},
)


def check_faithfulness(output, context):
    """Verify answer statements appear in retrieved context"""
    answer = output.get("answer")
    documents = context.get("documents", [])

    # Use LLM to verify each claim in answer against documents
    verification = llm.verify_claims(answer=answer, documents=documents)

    return {
        "score": verification.faithfulness_score,
        "metadata": {"unsupported_claims": verification.unsupported_claims, "total_claims": verification.total_claims},
    }

This faithfulness scorer returns a score between 0 and 1 representing the percentage of claims supported by context. The metadata includes unsupported claims for debugging. When faithfulness scores drop in production, drill into the metadata to see which specific claims lack support.

Implement ToolRAG for enhanced retrieval

Standard RAG retrieves documents and generates answers. ToolRAG treats retrieval as a tool the model can call dynamically. This allows models to decide when retrieval is needed, formulate better queries, and iterate on search results.

The Braintrust ToolRAG cookbook demonstrates implementing retrieval as a function tool. Models invoke the retrieval function with query strings, receive documents, and can call again with refined queries if initial results prove insufficient.

ToolRAG improves evaluation by making retrieval decisions explicit. You can measure when models choose to retrieve, whether they formulate effective queries, and how they use retrieved information. This visibility helps diagnose whether poor performance stems from retrieval strategy or document usage.

Run experiments to compare pipeline variations

RAG pipelines involve many design choices. Embedding models determine retrieval quality. Chunk size affects context precision. Number of documents retrieved balances precision and recall. Prompts guide how models use context.

Test these choices systematically through experiments. Change one variable, run evaluation on your dataset, compare results. This iterative process surfaces which components actually impact quality.

Braintrust experiments make comparison straightforward. Define your evaluation dataset, run variants with different configurations, and view score distributions side by side.

Below, we compare two retrieval strategies: a baseline retrieving 5 documents versus an experiment retrieving 10 documents. Both run against the same evaluation dataset and apply the same scorers (relevancy, faithfulness, and context precision):

python
from braintrust import eval

# Baseline with 5 documents
baseline_results = eval(
    name="RAG Baseline - 5 docs",
    data=evaluation_dataset,
    task=lambda query: rag_pipeline(query, top_k=5),
    scores=[relevancy_scorer, faithfulness_scorer, context_precision_scorer],
)

# Experiment with 10 documents
experiment_results = eval(
    name="RAG Experiment - 10 docs",
    data=evaluation_dataset,
    task=lambda query: rag_pipeline(query, top_k=10),
    scores=[relevancy_scorer, faithfulness_scorer, context_precision_scorer],
)

Braintrust shows score distributions for both experiments side by side. Review whether more documents improved context recall without degrading precision. Drill into specific examples where performance changed to understand the tradeoffs.

Monitor RAG quality in production

Offline evaluation catches many issues before deployment. Production reveals problems you did not anticipate. Monitor key metrics continuously to detect degradation.

Track answer relevancy scores on live requests, faithfulness scores to catch hallucination, retrieval latency and error rates, user feedback signals (thumbs up/down), and escalation rates to human support.

Set alerts for metric degradation. If faithfulness scores drop suddenly, investigate whether retrieved documents changed or model behavior shifted. If retrieval latency spikes, check if your vector database is overloaded.

Braintrust online evaluation supports continuous monitoring. Define scorers that run on production traffic, sample based on volume, and alert when quality degrades. Feed low-scoring examples back into offline datasets for systematic improvement.

Practical RAG evaluation strategies

Start with golden datasets

Before building comprehensive evaluation pipelines, create a small set of high-quality examples representing your most important use cases. In Braintrust, build your golden dataset by curating production traces or creating representative examples. These datasets serve as regression tests and quality benchmarks.

Include successful query-answer pairs that should always work, known failure cases you fixed, questions requiring multi-document synthesis, and questions that should return "insufficient information" when data is unavailable.

Run every pipeline change against your golden dataset in Braintrust. If a new embedding model breaks previously working queries, you catch it before deployment. If a prompt modification improves scores, that signals readiness for broader testing.

Evaluate retrieval and generation separately

RAG pipelines combine two complex systems. When end-to-end evaluation shows poor performance, separate evaluation pinpoints the problem. In Braintrust, you can evaluate retrieval and generation independently by examining individual spans in your traces.

Test retrieval in isolation by evaluating whether correct documents appear in retrieved results, checking relevance scores and rankings, and measuring retrieval latency and consistency. Create scorers that run only on the retrieval span.

Test generation separately by providing known good context and evaluating answer quality, checking whether models use all provided information, and measuring how performance degrades with more/less context. Create scorers that focus on the generation span.

This separation in Braintrust accelerates debugging. Low answer quality with high retrieval quality means your prompt or model needs adjustment. High answer quality with low retrieval quality means your embedding model or indexing strategy needs work.

Use reference-free and reference-based metrics together

Reference-free metrics evaluate quality without ground truth answers. LLM-as-a-judge scores, semantic similarity to questions, and faithfulness to context all work without labeled data.

Reference-based metrics compare against known correct answers. They provide objective measurements but require building labeled datasets.

Combine both approaches. Use reference-free metrics for rapid iteration and broad coverage. Use reference-based metrics for precise measurement on critical queries. This balance provides both speed and accuracy.

Build query type taxonomies

RAG systems handle diverse question types. Simple factual lookup questions differ from complex analytical questions. Both differ from navigational questions seeking specific documents.

In Braintrust, categorize your evaluation queries using metadata tags. Tag each question with characteristics like: factual vs analytical, single-document vs multi-document, objective vs subjective answer, and common vs rare query pattern.

Evaluate performance across categories using Braintrust's filtering capabilities. If your system excels at factual questions but struggles with analytical questions, that reveals where to focus improvement. If rare queries perform poorly, that suggests retrieval coverage gaps.

Test retrieval edge cases

Production queries challenge RAG systems in ways development queries never do. Users make typos. They use domain terminology inconsistently. They ask about topics tangentially related to your knowledge base.

Build evaluation datasets that stress test retrieval: queries with typos and misspellings, queries using synonyms or alternative terminology, queries about topics not directly covered, and queries that should return "no information available."

These edge cases reveal system robustness. RAG pipelines that gracefully handle uncertainty and missing information build more user trust than systems that confidently hallucinate.

Advanced RAG evaluation patterns

Measure context utilization

Retrieving relevant documents means nothing if your model ignores them. Context utilization measures whether generated answers actually incorporate information from retrieved documents.

Check whether key facts from context appear in answers. Track which documents in the retrieved set get used. Identify patterns where models consistently ignore specific document types or positions.

Implementation approaches include salience scoring that tracks which context portions appear in answers, document attribution that identifies which retrieved docs contributed to the response, and position bias analysis that reveals whether models favor certain context positions.

Evaluate query reformulation

Many RAG systems reformulate user queries before retrieval. A question like "How do I return stuff?" might reformulate to "return policy process" for better retrieval.

Evaluate whether reformulations improve retrieval quality. Log original queries, reformulated queries, and retrieval results for both. Measure whether reformulation increases relevant document retrieval.

Test reformulation robustness. Some queries benefit from expansion while others need narrowing. Identify patterns in when reformulation helps versus hurts.

Test multi-turn conversation

RAG in conversation requires managing context across turns. Users ask follow-up questions expecting the system remembers previous exchanges. "What about electronics?" only makes sense after discussing returns.

Evaluate conversation-aware RAG by creating multi-turn test scenarios, checking whether context carries across turns correctly, and measuring when systems lose track of conversation state.

This reveals whether your RAG pipeline handles pronouns and references correctly, maintains topic consistency across turns, and knows when to retrieve new information versus using conversation history.

Measure chunk boundary effects

Document chunking determines retrieval granularity. Chunk size and overlap affect whether relevant information appears in retrieved results.

Test how chunking impacts quality. Create evaluation scenarios where correct information spans chunk boundaries. Measure whether your retrieval strategy surfaces both chunks or misses the connection.

Experiment with chunk sizes and overlap percentages. Smaller chunks increase retrieval precision but risk splitting related information. Larger chunks ensure completeness but reduce precision. Optimal size depends on your documents and queries.

RAG evaluation with Braintrust

Automatic trace capture for RAG pipelines

Every RAG request generates rich execution data. Braintrust automatically captures query text and any preprocessing, retrieved documents with relevance scores, context sent to the generation model, generated response, and timing information for each step.

This trace data enables detailed evaluation. When answer relevancy scores are low, inspect whether retrieval surfaced wrong documents or generation misused correct context. When faithfulness scores drop, see exactly which context was available and what claims appeared unsupported.

Built-in RAG scorers

Braintrust provides RAG-specific scorers out of the box. Use semantic similarity to measure answer-question alignment, factual consistency for faithfulness checking, and context relevance for retrieval quality.

Combine built-in scorers with custom scorers for domain-specific requirements. Financial RAG systems might need number extraction accuracy scorers. Medical RAG might require clinical terminology consistency checks.

Experiment comparison for retrieval optimization

RAG improvement requires systematic comparison. Test different embedding models, compare chunking strategies, evaluate retrieval algorithm variants, and optimize context window usage.

Braintrust experiment views show score distributions across all variants. Filter by query type to understand performance patterns. Drill into specific examples where variants produced different results.

This comparison workflow turns RAG optimization from guesswork into measurement. Instead of wondering whether a change helped, you see exactly which queries improved and which regressed.

Production monitoring with online evaluation

RAG quality degrades silently in production. Document updates change retrieval behavior. User query patterns shift. Model updates affect generation quality.

Monitor production RAG with continuous evaluation. Sample live requests based on volume. Run scorers on samples to track quality trends. Alert when metrics fall below thresholds.

Feed problematic production queries back into offline datasets. This creates a continuous improvement loop where production failures become test cases, driving systematic fixes.

Dataset curation from traces

The best evaluation data comes from real usage. Braintrust makes it easy to build datasets from production traces.

Browse production traces and add interesting cases to datasets. Filter by score ranges to find edge cases. Tag queries by topic or difficulty. Export datasets for offline analysis.

This tight integration between production logging and evaluation ensures your tests reflect actual usage patterns rather than synthetic scenarios.

Case study: Evaluating a documentation RAG system

Consider a RAG system for answering questions about API documentation. Users ask how to use endpoints, what parameters are required, and how to handle errors.

In Braintrust, create a golden dataset of 50 common questions with known correct answers. Include simple queries like "How do I authenticate?" and complex queries like "What's the difference between sync and async endpoints?"

Define scorers measuring answer correctness, context precision (do retrieved docs contain API info?), and completeness (does answer include all required parameters?).

Run baseline evaluation in Braintrust. Identify failure patterns: retrieval misses relevant docs for certain query types, generation invents parameter names, answers omit error handling.

Run Braintrust experiments testing improvements: a better embedding model for code snippets, increased document count from 5 to 8, updated generation prompt emphasizing completeness.

Braintrust shows the new embedding model improved retrieval precision by 15%. More documents improved completeness with minimal precision loss. The prompt update reduced invented parameters.

Deploy with Braintrust's online monitoring. Track answer correctness and faithfulness scores. Alert when scores drop below baseline. After two weeks, evaluate performance on new production queries.

Feed low-scoring queries from Braintrust traces back into your golden dataset. Iterate on failures. This continuous loop drives steady improvement.

Common RAG evaluation pitfalls

Evaluating only end-to-end quality

Testing complete RAG pipelines without measuring individual components makes debugging difficult. When quality drops, you need to know whether retrieval, generation, or their interaction caused the problem.

Always evaluate retrieval and generation separately alongside end-to-end metrics. This separation enables precise diagnosis and targeted improvements.

Using only synthetic evaluation data

Synthetic questions help ensure coverage but miss real usage patterns. Users phrase questions in unexpected ways. Production reveals edge cases you never anticipated.

Build evaluation datasets from production queries as soon as possible. Synthetic data provides initial coverage, but real queries drive realistic evaluation.

Ignoring retrieval ranking quality

Focusing only on whether relevant documents appear in results misses ranking quality. If the best document appears at position 10, your model might not see it due to context length limits.

Evaluate ranking with position-weighted metrics like NDCG. Measure performance at different K values to understand precision-recall tradeoffs.

Testing only common queries

Evaluating frequent query patterns shows average-case performance but misses edge cases that frustrate users. Rare queries often surface retrieval gaps and generation brittleness.

Actively include edge cases in evaluation datasets. Test typos, unusual phrasings, tangential questions, and queries that should return "no information available."

Neglecting latency evaluation

RAG systems face strict latency requirements. Users tolerate maybe 2-3 seconds for answers. Retrieval plus generation easily exceeds this budget without optimization.

Include latency metrics in every evaluation. Measure p50, p95, and p99 latencies. Set acceptable thresholds and fail tests that exceed them. Profile your pipeline to identify bottlenecks early.

Assuming faithfulness from high relevancy

Relevant answers can still hallucinate. Models sometimes combine retrieved information with outside knowledge, creating plausible but unsupported claims.

Always evaluate faithfulness separately from relevancy. Verify that answer claims appear in retrieved context. Track hallucination rates continuously.

Continuous RAG improvement through evaluation

RAG quality requires ongoing attention. Document updates change retrieval behavior. User needs evolve. New failure patterns emerge. Systematic evaluation enables continuous improvement.

Build evaluation into your development workflow. Run offline evaluation on every pipeline change. Review experiment results before deploying updates. Monitor production metrics continuously. Feed production failures back into offline datasets.

This evaluation loop compounds improvements over time. Production traces reveal real failures. Offline evaluation catches regressions. Experiments compare alternatives systematically. Each iteration makes your RAG system measurably better.

Getting started with RAG evaluation

Start small and build systematically. Begin by instrumenting your pipeline with tracing to capture queries, retrieved documents, and generated answers. Create an initial evaluation dataset of 30-50 real or realistic queries covering your main use cases. Define scorers for answer relevancy, faithfulness, and context precision.

Run your first offline evaluation comparing your current pipeline against a variation (different top-K, updated prompt, or new embedding model). Deploy the winner with production monitoring. Feed low-scoring queries back into your dataset. Repeat.

As your evaluation matures, expand coverage. Add more scorers for specific quality dimensions. Build query-type taxonomies to understand performance patterns. Test edge cases systematically. Automate regression detection. Focus on continuous measurement-driven improvement.

Resources and next steps

Frequently asked questions

What are the most important RAG evaluation metrics?

The core RAG metrics are answer relevancy (does the answer address the question?), faithfulness (does the answer stay true to retrieved documents?), context precision (are retrieved documents relevant?), and context recall (did retrieval find all relevant information?). Start with these four and add domain-specific metrics as needed.

How do I evaluate RAG systems without labeled data?

In Braintrust, configure LLM-as-a-judge scorers to evaluate relevancy and faithfulness without ground-truth labels. Use semantic similarity metrics to compare questions and answers, and build consistency scorers that verify answers align with retrieved context. These reference-free approaches enable immediate evaluation.

What's the difference between faithfulness and groundedness?

These terms are often used interchangeably. Both measure whether generated answers stay true to retrieved context without hallucinating. Faithfulness typically refers to overall answer truthfulness, while groundedness specifically checks that individual claims have document support.

How many documents should I retrieve for RAG?

Optimal retrieval count depends on your use case. Fewer documents (3-5) increase precision and reduce costs but risk missing information. More documents (10-20) increase recall but add noise and latency. Use Braintrust experiments to test different retrieval counts and measure precision-recall tradeoffs on your evaluation dataset.

How do I evaluate RAG retrieval quality separately from generation?

In Braintrust, evaluate retrieval by examining the retrieval span independently. Score whether retrieved documents contain information needed to answer questions using LLM-as-a-judge scorers. Measure precision (percentage of relevant retrieved docs) and recall (percentage of available relevant docs retrieved) to identify retrieval issues distinct from generation problems.

What evaluation metrics work for conversational RAG?

Conversational RAG requires additional metrics beyond single-turn evaluation. Measure context retention across turns, pronoun and reference resolution accuracy, and topic coherence throughout conversations. Track whether the system knows when to retrieve new information versus using conversation history.

How often should I run RAG evaluations?

Run offline evaluations in Braintrust on every code change before deployment. Enable online evaluation to monitor production continuously with sampled live traffic. Review results weekly to identify trends. Expand your golden dataset monthly with new production edge cases captured in Braintrust traces.