RAG evaluation

Evaluate your RAG pipeline from retrieval to response

Score retrieval quality and answer faithfulness, improve RAG accuracy with every experiment, and catch regressions before they reach production.

Free to start · No credit card · 5-min setup

Braintrust logs view showing RAG traces with retrieval and generation scores

Trusted by AI teams at

Watch video
Read story
Watch video
Watch video
Read story
Watch video

Evaluate retrieval and generation separately

Braintrust shows you exactly where your pipeline breaks.

1
from braintrust import traced

@traced
def retrieve(query):
  return vector_db.search(query)

@traced
def generate(query, context):
  return llm(query, context)
Trace your pipeline
Wrap retrieval and generation in traced spans. Every query, retrieved document, and generated answer is captured automatically.
2
from autoevals import (
  ContextPrecision,
  ContextRecall,
  Faithfulness,
  AnswerRelevancy,
)

Eval(
  "My RAG",
  scores=[
    ContextPrecision,
    ContextRecall,
    Faithfulness,
    AnswerRelevancy,
  ],
)
Score both stages
Retrieval scorers measure whether the right documents were fetched. Generation scorers measure whether the answer is faithful and relevant.
3
- name: RAG eval gate
  uses: braintrustdata/eval-action@v1
  with:
    api_key: ${{ secrets.BRAINTRUST_API_KEY }}
    fail_on_regression: true
Fix and gate releases
Drill into failing queries, see exactly which documents were retrieved, and fix the right thing. Gate deploys in CI so retrieval regressions never reach production.

What changes when Braintrust is part of your workflow

10x

Faster issue resolution

<10 min

Eval turnaround

25%

Accuracy improvement

45x

More feedback

Notion, Dropbox, Zapier, and Coursera use Braintrust to ship better AI. Get started free →

Works how your team works

Braintrust measures both the retrieval and generation stage so you know exactly where to fix it.

For engineers

Braintrust logs view with trace tree and RAG eval scores

Build custom datasets with existing production logs. Understand whether you’re using the right embedding model, and weigh cost and token usage against accuracy.

For PMs & domain experts

Braintrust human review panel showing RAG scores and trace tree

Retrieval and generation scores in one view. See if a drop in answer quality comes from bad retrieval or bad generation.

Built for evals from the start

Find patterns without reading every span

Use Loop to synthesize a starting dataset and find patterns in your traces like hallucination or issues in the vector retrieval process without manually reading each span yourself.

20+ scorers, ready to use

Faithfulness, ContextPrecision, ContextRecall, AnswerRelevancy, and more via autoevals. Or write your own. No scorer infrastructure to build or maintain.

Production failures become RAG test cases

When a query returns a bad answer in production, tag it and it lands in your golden dataset. Your RAG testing suite grows from real failures.

What our customers say

“There are some problems we wouldn't know were problems without Braintrust.”

Sarah Sachs, AI Lead at Notion

Stop shipping on vibes

Set up your first eval in minutes

Free to start · No credit card required

Get started free