Evaluate your RAG pipeline from retrieval to response
Score retrieval quality and answer faithfulness, improve RAG accuracy with every experiment, and catch regressions before they reach production.
Free to start · No credit card · 5-min setup

Trusted by AI teams at
Evaluate retrieval and generation separately
Braintrust shows you exactly where your pipeline breaks.
from braintrust import traced @traced def retrieve(query): return vector_db.search(query) @traced def generate(query, context): return llm(query, context)
from autoevals import (
ContextPrecision,
ContextRecall,
Faithfulness,
AnswerRelevancy,
)
Eval(
"My RAG",
scores=[
ContextPrecision,
ContextRecall,
Faithfulness,
AnswerRelevancy,
],
)- name: RAG eval gate
uses: braintrustdata/eval-action@v1
with:
api_key: ${{ secrets.BRAINTRUST_API_KEY }}
fail_on_regression: trueWhat changes when Braintrust is part of your workflow
10x
Faster issue resolution
<10 min
Eval turnaround
25%
Accuracy improvement
45x
More feedback
Notion, Dropbox, Zapier, and Coursera use Braintrust to ship better AI. Get started free →
Works how your team works
Braintrust measures both the retrieval and generation stage so you know exactly where to fix it.
For engineers

Build custom datasets with existing production logs. Understand whether you’re using the right embedding model, and weigh cost and token usage against accuracy.
For PMs & domain experts

Retrieval and generation scores in one view. See if a drop in answer quality comes from bad retrieval or bad generation.
Built for evals from the start
Find patterns without reading every span
Use Loop to synthesize a starting dataset and find patterns in your traces like hallucination or issues in the vector retrieval process without manually reading each span yourself.
20+ scorers, ready to use
Faithfulness, ContextPrecision, ContextRecall, AnswerRelevancy, and more via autoevals. Or write your own. No scorer infrastructure to build or maintain.
Production failures become RAG test cases
When a query returns a bad answer in production, tag it and it lands in your golden dataset. Your RAG testing suite grows from real failures.
What our customers say
“There are some problems we wouldn't know were problems without Braintrust.”
Sarah Sachs, AI Lead at Notion

Stop shipping on vibes
Set up your first eval in minutes
Free to start · No credit card required
Get started free