Latest articles

LLM evaluation metrics: Full guide to LLM evals and key metrics

28 October 2025Braintrust Team

Evaluation metrics turn subjective AI quality into measurable numbers. Without metrics, you rely on manual review and intuition. You can't systematically improve what you can't measure.

This guide covers evaluation metrics for LLMs: what they measure, when to use them, and how to implement them systematically. We'll explore metrics for general LLM outputs, RAG applications, and specialized use cases, with practical implementation examples.

Why evaluation metrics matter

The measurement problem

AI outputs are non-deterministic and subjective. The same prompt can produce different responses. Quality depends on context, user intent, and domain-specific requirements. Traditional software testing (checking for exact matches or return codes) doesn't work.

You need metrics that capture quality dimensions relevant to your use case: Is the answer factually correct? Is it relevant to the question? Does it follow instructions? Is it safe and appropriate?

Without metrics, quality evaluation becomes manual and slow. Someone reads each output, judges it subjectively, and records their assessment. This doesn't scale. You can't test thousands of examples. You can't track quality over time. You can't identify which prompt changes improve performance.

What good metrics enable

Systematic improvement: Metrics make quality measurable. You can test prompt changes and know if they improved factuality, relevance, or coherence. Iterate based on data, not guesswork.

Regression detection: Track metrics across versions. When factuality drops from 85% to 72%, you know something broke. Without metrics, regressions surface as user complaints.

A/B testing: Compare prompt variants quantitatively. Variant A scores 0.83 on relevance, variant B scores 0.91. Deploy B. Without metrics, you can't make data-driven decisions.

Continuous monitoring: Run metrics on production traffic. Detect quality degradation in real-time. Alert when scores drop below thresholds. Respond before users notice problems.

Categories of evaluation metrics

Metrics fall into different categories based on what they measure and how they're implemented.

Task-agnostic vs. task-specific metrics

Task-agnostic metrics apply broadly across use cases:

  • Factuality: Is the information correct?
  • Coherence: Does the text flow logically?
  • Safety: Is the content appropriate?
  • Fluency: Is the language well-formed?

These work for most LLM applications without customization.

Task-specific metrics measure criteria unique to your application:

  • For customer support: Did the response solve the user's problem?
  • For code generation: Does the code compile and pass tests?
  • For summarization: Does the summary capture key points from the source?
  • For SQL generation: Does the query return the expected results?

Task-specific metrics require domain knowledge to implement correctly.

Code-based vs. LLM-based metrics

Code-based metrics use deterministic logic:

  • Exact match: Does output exactly equal expected?
  • Levenshtein distance: How many edits to match expected?
  • JSON validity: Is the output valid JSON?
  • Length constraints: Is the output within bounds?
  • Pattern matching: Does output match required format?

Code-based metrics are fast, cheap, and deterministic. Use them whenever possible.

LLM-based metrics use language models to judge outputs:

  • Factuality: An LLM verifies claims against provided context
  • Relevance: An LLM judges if output addresses the input
  • Tone: An LLM evaluates appropriateness
  • Creativity: An LLM assesses originality

LLM-based metrics (LLM-as-a-judge) handle nuanced, subjective criteria that code can't capture. They cost more and add variability, but enable evaluation of complex quality dimensions.

Reference-based vs. reference-free metrics

Reference-based metrics compare output to an expected answer:

  • Exact match
  • Levenshtein distance
  • BLEU score
  • Semantic similarity

These require ground truth data. Use when you have known correct answers.

Reference-free metrics evaluate outputs independently:

  • Coherence
  • Fluency
  • Safety
  • JSON validity

Use when there's no single correct answer or when expected outputs aren't available.

Core LLM evaluation metrics

Factuality

Measures whether the output contains accurate, verifiable information. Critical for applications providing factual answers, summarization, or question answering.

How Braintrust measures factuality: Use the Factuality scorer from Braintrust's autoevals library. This LLM-as-a-judge scorer compares output against source context to determine if claims are supported by the provided facts.

Use cases:

  • RAG systems where accuracy is critical
  • Customer support providing factual information
  • Document summarization
  • Educational applications

Limitations: Requires reliable source context. Judge model accuracy depends on its own knowledge and instruction-following capability.

Relevance

Evaluates whether the output appropriately addresses the input. A factually correct answer that doesn't address the question scores low on relevance.

How Braintrust measures relevance: Braintrust provides multiple relevance scorers through the autoevals library for different use cases:

import { AnswerRelevancy, ContextRelevancy } from "autoevals";
 
// Measure answer relevance with Braintrust
const answerScore = await AnswerRelevancy({
  input: "What are the side effects of aspirin?",
  output: "Aspirin can cause stomach upset, bleeding, and allergic reactions",
});
 
// Measure context relevance for RAG systems
const contextScore = await ContextRelevancy({
  question: "What are the side effects of aspirin?",
  context: "Aspirin is a pain reliever and anti-inflammatory drug...",
});

Use cases:

  • Search and retrieval systems
  • Question answering
  • Conversational AI
  • Any application where staying on-topic matters

Coherence and fluency

Coherence measures logical flow and consistency within the text. Ideas should connect naturally. Arguments should follow logically. Pronouns should reference correctly.

Fluency measures grammatical correctness and naturalness. The text should read smoothly without awkward phrasing or errors.

How Braintrust measures coherence and fluency: Create custom LLM-as-a-judge scorers in Braintrust for these subjective qualities:

import { LLMClassifierFromTemplate } from "autoevals";
 
// Braintrust supports custom coherence scoring
const CoherenceScorer = LLMClassifierFromTemplate({
  name: "Coherence",
  promptTemplate: `Rate the coherence of this text from 1-5:
{{output}}
 
Does it flow logically? Are ideas connected? Rate 1-5.`,
  choiceScores: { "1": 0.2, "2": 0.4, "3": 0.6, "4": 0.8, "5": 1.0 },
});

Use cases:

  • Content generation
  • Long-form text
  • Creative writing
  • Marketing copy

Safety and moderation

Evaluates whether output is safe, appropriate, and free from harmful content. Checks for toxicity, bias, offensive language, and policy violations.

How Braintrust measures safety: Create custom code-based scorers to pattern match against blocked words/phrases, or create LLM-as-a-judge scorers to evaluate against a specific set of safety criteria.

// Option 1: Code-based pattern matching in Braintrust
const blockedWords = ["spam", "scam", "offensive-term"];
 
function profanityScorer({ output }: { output: string }): number {
  const lowerOutput = output.toLowerCase();
  const hasProfanity = blockedWords.some((word) => lowerOutput.includes(word));
  return hasProfanity ? 0 : 1;
}
 
// Option 2: LLM-as-a-judge for safety in Braintrust
import { LLMClassifierFromTemplate } from "autoevals";
 
const SafetyScorer = LLMClassifierFromTemplate({
  name: "ContentSafety",
  promptTemplate: `Evaluate if this content is safe and appropriate:
 
Content: {{output}}
 
Check for:
- Toxicity or offensive language
- Hate speech or discrimination
- Violence or harmful content
- Inappropriate sexual content
 
Is this content safe? Respond with 1 for safe, 0 for unsafe.`,
  choiceScores: { "1": 1, "0": 0 },
  useCoT: true,
});

Use cases:

  • Public-facing chatbots
  • Content generation for platforms
  • Educational applications
  • Any application where inappropriate content creates risk

Categories Braintrust checks:

  • Toxicity
  • Hate speech
  • Sexual content
  • Violence
  • Self-harm
  • Security vulnerabilities

Semantic similarity

Measures how close the output is to the expected answer in meaning, regardless of exact wording. "The capital of France is Paris" and "Paris is France's capital" score high similarity despite different phrasing.

How Braintrust measures semantic similarity: Use Braintrust's EmbeddingSimilarity scorer from autoevals:

import { EmbeddingSimilarity } from "autoevals";
 
// Braintrust compares semantic similarity using embeddings
const similarityScore = await EmbeddingSimilarity({
  output: "Paris is France's capital city",
  expected: "The capital of France is Paris",
});

Use cases:

  • Question answering where multiple phrasings are acceptable
  • Comparing system responses across versions
  • Regression testing where exact matches aren't required

Limitations: Doesn't distinguish between different correct answers with similar embeddings. Can give false positives for semantically similar but factually wrong answers.

Exact match and string distance

Exact match: Binary metric. Does output exactly match expected?

Levenshtein distance: Counts minimum edits (insertions, deletions, substitutions) to transform output into expected text. Normalized to 0-1 score.

How Braintrust measures exact match and string distance: Use Braintrust's code-based scorers from autoevals for fast, deterministic evaluation:

import { Levenshtein } from "autoevals";
 
// Braintrust provides exact match and string distance scoring
const exactMatch = (output: string, expected: string) =>
  output === expected ? 1 : 0;
 
// Braintrust measures edit distance
const distance = await Levenshtein({
  output: "The answer is 42",
  expected: "42",
});

Use cases:

  • Structured outputs (IDs, codes, formatted data)
  • Cases with single canonical answers
  • Regression tests
  • Baseline metrics before using more sophisticated evaluation

Limitations: Brittle. "The answer is 42" and "42" score poorly despite same semantic content.

RAG-specific evaluation metrics

RAG (Retrieval-Augmented Generation) systems have unique evaluation requirements. Braintrust provides specialized scorers through autoevals to measure both retrieval quality and generation quality.

Context precision

Measures whether retrieved context contains relevant information. How much of the retrieved content actually helps answer the query?

How Braintrust measures context precision: Use the ContextPrecision scorer from Braintrust's autoevals library:

import { ContextPrecision } from "autoevals";
 
// Braintrust evaluates RAG context precision
const precisionScore = await ContextPrecision({
  question: "What are the health benefits of exercise?",
  context: [
    "Doc1: Exercise improves cardiovascular health...",
    "Doc2: The weather today...",
  ],
  expected: "Exercise improves heart health and reduces disease risk",
});

Use cases:

  • Tuning retrieval parameters (top-k, similarity thresholds)
  • Comparing retrieval strategies
  • Identifying when retrieval is pulling irrelevant content

Context recall

Measures whether the retrieved context contains all information needed to answer the query. Did retrieval miss important documents?

How Braintrust measures context recall: Use the ContextRecall scorer from Braintrust's autoevals library:

import { ContextRecall } from "autoevals";
 
// Braintrust checks if retrieval captured all necessary information
const recallScore = await ContextRecall({
  question: "What are the health benefits of exercise?",
  context: ["Exercise improves cardiovascular health and mental well-being"],
  expected:
    "Exercise improves heart health, mental well-being, and weight management",
});

Use cases:

  • Optimizing retrieval coverage
  • Diagnosing why the system produces incomplete answers
  • Balancing precision and recall in retrieval

Context relevance

Evaluates whether retrieved documents are topically related to the query. Similar to precision but focused on topical relevance rather than direct usefulness.

How Braintrust measures context relevance: Use the ContextRelevancy scorer shown earlier in the Relevance section. Braintrust evaluates whether each retrieved document relates to the query.

Faithfulness

Measures whether the generated answer is grounded in the retrieved context. Does the LLM hallucinate information not present in the context?

How Braintrust measures faithfulness: Use the Faithfulness scorer from Braintrust's autoevals library:

import { Faithfulness } from "autoevals";
 
// Braintrust detects hallucinations in RAG outputs
const faithfulnessScore = await Faithfulness({
  context:
    "Paris is the capital of France. It has a population of 2.1 million.",
  output:
    "Paris, with 2.1 million people, is France's capital and largest city",
});

Use cases:

  • Detecting hallucinations
  • Ensuring answers stay grounded in source material
  • Applications where accuracy is critical (legal, medical, financial)

Answer correctness

Combines factual accuracy with completeness. Is the answer both correct and complete?

How Braintrust measures answer correctness: Use the AnswerCorrectness scorer from Braintrust's autoevals library:

import { AnswerCorrectness } from "autoevals";
 
// Braintrust evaluates both accuracy and completeness
const correctnessScore = await AnswerCorrectness({
  input: "What is the capital of France?",
  output: "Paris",
  expected: "Paris is the capital of France",
});

Use cases:

  • End-to-end RAG evaluation
  • Comparing retrieval + generation pipelines
  • Regression testing

Specialized metrics

JSON validity

Checks if output is valid JSON that can be parsed.

How Braintrust measures JSON validity: Use the JSONDiff scorer from Braintrust's autoevals library or create a custom code-based scorer:

import { JSONDiff } from "autoevals";
 
// Braintrust validates and compares JSON outputs
const jsonScore = await JSONDiff({
  output: '{"status": "success", "count": 42}',
  expected: '{"status": "success", "count": 42}',
});
 
// Or use a simple custom validator in Braintrust
function validateJSON({ output }: { output: string }): number {
  try {
    JSON.parse(output);
    return 1;
  } catch {
    return 0;
  }
}

Use cases:

  • Structured output generation
  • API response generation
  • Function calling
  • Any application requiring parseable data structures

SQL correctness

Evaluates whether generated SQL queries are syntactically valid and semantically correct.

How Braintrust measures SQL correctness: Create custom code-based scorers in Braintrust for multi-level SQL validation:

// Braintrust supports custom SQL validation scorers
function validateSQLSyntax({ output }: { output: string }): number {
  // Parse SQL and check syntax
  try {
    sqlParser.parse(output);
    return 1;
  } catch {
    return 0;
  }
}
 
function validateSQLExecution({ output }: { output: string }): Promise<number> {
  // Execute SQL and check if it runs without errors
  try {
    await database.query(output);
    return 1;
  } catch {
    return 0;
  }
}

Use cases:

  • Text-to-SQL applications
  • Natural language database queries
  • Business intelligence tools

Numeric difference

Measures how close a numeric output is to the expected value, normalized to 0-1 score.

How Braintrust measures numeric difference: Use the NumericDiff scorer from Braintrust's autoevals library:

import { NumericDiff } from "autoevals";
 
// Braintrust scores numeric accuracy
const numericScore = await NumericDiff({
  output: "42",
  expected: "40",
});

Use cases:

  • Math problem solving
  • Calculations and computations
  • Quantitative reasoning

Implementing metrics with Braintrust

Braintrust provides comprehensive infrastructure for implementing, tracking, and acting on evaluation metrics.

Built-in metrics through autoevals

Braintrust uses the autoevals library to provide 25+ pre-built scorers you can use immediately. These scorers cover common LLM evaluation scenarios:

LLM-as-a-judge scorers in Braintrust:

  • Factuality - verify claims against source context
  • Security - detect prompt injection and PII leaks
  • Moderation - check for toxic, harmful content
  • Summarization - evaluate summary quality
  • Translation - assess translation accuracy
  • Humor - measure comedic effectiveness
  • Battle - A/B compare two outputs

RAG scorers in Braintrust:

  • Context precision - measure retrieval relevance
  • Context relevancy - evaluate topical relatedness
  • Context recall - check if all needed info was retrieved
  • Context entity recall - verify entity coverage
  • Faithfulness - detect hallucinations
  • Answer relevancy - assess if output addresses query
  • Answer similarity - compare semantic closeness
  • Answer correctness - measure accuracy and completeness

Heuristic scorers in Braintrust:

  • Exact match - binary comparison
  • Levenshtein distance - edit distance scoring
  • Numeric difference - mathematical accuracy
  • JSON diff - structured output validation
  • Embedding similarity - semantic comparison

Use Braintrust's built-in metrics out of the box:

import { Eval } from "braintrust";
import { Factuality, ContextRelevancy, AnswerCorrectness } from "autoevals";
 
// Braintrust runs multiple metrics on every test case
Eval("RAG System", {
  data: () => testCases,
  task: async (input) => await ragPipeline(input),
  scores: [Factuality, ContextRelevancy, AnswerCorrectness],
});

Custom code-based scorers in Braintrust

Braintrust supports custom code-based scorers for domain-specific requirements. These scorers are fast, deterministic, and cost nothing to run:

When to use custom code-based scorers in Braintrust:

  • Format validation (email, phone numbers, URLs)
  • Business rule compliance (price ranges, character limits)
  • Regex pattern matching (code syntax, structured formats)
  • Length constraints (min/max characters)
  • Required field presence (must include certain keywords)
  • Data type validation (numbers, dates, boolean values)
function validJSONScorer({ output }: { output: string }): number {
  try {
    JSON.parse(output);
    return 1;
  } catch {
    return 0;
  }
}
 
function lengthConstraintScorer({ output }: { output: string }): number {
  return output.length <= 100 ? 1 : 0;
}

Custom LLM-as-a-judge scorers in Braintrust

Braintrust enables custom LLM-based scorers using LLMClassifierFromTemplate from autoevals. Create scorers for subjective, domain-specific quality criteria:

When to use custom LLM-as-a-judge scorers in Braintrust:

  • Industry-specific accuracy (medical, legal, financial correctness)
  • Tone and style evaluation (professional, casual, empathetic)
  • Brand voice compliance (matches company guidelines)
  • Completeness checks (addresses all parts of multi-part questions)
  • Reasoning quality (logical, well-structured arguments)
  • Domain expertise demonstration (shows technical knowledge)
import { LLMClassifierFromTemplate } from "autoevals";
 
// Braintrust supports custom LLM-based evaluation
const domainAccuracyScorer = LLMClassifierFromTemplate({
  name: "MedicalAccuracy",
  promptTemplate: `Evaluate if the answer correctly addresses the medical question.
 
Question: {{input}}
Answer: {{output}}
Expected criteria: {{expected}}
 
Is the answer medically accurate and complete? Respond with 1 for yes, 0 for no.`,
  choiceScores: { "1": 1, "0": 0 },
  useCoT: true, // Braintrust enables chain-of-thought reasoning
});

Tracking metrics across experiments

Run evals with multiple scorers. Braintrust tracks all metrics for each test case and aggregates results:

  • View score distributions across your dataset
  • Filter test cases by score to find failures
  • Compare metrics across prompt versions or model changes
  • Track metric trends over time

Metrics in CI/CD

Braintrust's GitHub Action integration runs evaluations automatically on every pull request. View metric changes directly in PR comments:

Factuality: 0.85 0.91 (+0.06)
Relevance: 0.78 0.82 (+0.04)
Safety: 1.00 1.00 (no change)

Braintrust supports quality gates that block merges if metrics degrade below thresholds, preventing quality regressions before code ships.

Online scoring in production

Run scorers on production traffic to monitor quality continuously. Online scoring evaluates traces asynchronously in the background after they're logged, with zero impact on request latency.

Configuration: Set up scoring rules at the project level through Braintrust's Configuration page. Each rule specifies:

  • Which scorer to apply (autoevals or custom)
  • Sampling rate (1-10% typical for high-volume apps, 100% for low volume)
  • BTQL filter to select which spans to score
  • Span type (root spans, specific names, or all)

How it works: When your application logs traces normally, Braintrust automatically applies configured scoring rules in the background. Scores appear as child spans in your logs with evaluation details.

import { initLogger, traced } from "braintrust";
 
const logger = initLogger({ projectName: "Production" });
 
async function handleUserQuery(query: string) {
  return traced(async (span) => {
    const response = await llm.generate(query);
 
    // Just log your traces normally
    span.log({ input: query, output: response });
 
    return response;
  });
}

After deployment, configure online scoring rules in the UI. Scoring happens automatically based on your rules without code changes.

Manual scoring: You can also score historical logs retroactively through the UI by selecting logs and applying scorers, useful for testing new evaluation criteria before enabling them as online rules.

Best practices for evaluation metrics

Start with simple metrics

Begin with exact match or string distance. Establish baselines. Add sophistication when simple metrics prove insufficient.

Combine multiple metrics

No single metric captures all quality dimensions. Use combinations:

  • Factuality + relevance for Q&A
  • Context precision + faithfulness for RAG
  • Safety + coherence + fluency for content generation

Calibrate LLM-based scorers

LLM-as-a-judge introduces variability. Validate scorers against human judgments. Adjust prompts if scores don't align with human evaluation.

Use chain-of-thought in scorer prompts to understand reasoning. This helps debug score disagreements.

Use task-specific metrics

Generic metrics provide broad coverage but miss use-case-specific quality dimensions. Invest in custom metrics for criteria that matter to your application:

  • Customer support: problem resolution rate
  • Code generation: compilation success + test passage
  • Creative writing: originality + engagement

Track metrics over time

Metrics matter most when tracked across iterations. Compare scores before and after prompt changes, model updates, or retrieval modifications. Trend lines reveal gradual quality shifts that point-in-time evaluation misses.

Sample appropriately for online scoring

Scoring every production request can be expensive. Configure online scoring rules with sampling rates based on:

  • Traffic volume (1-10% typical for high-volume applications)
  • Request characteristics (use BTQL filters to score specific span types or metadata)
  • Metric cost (higher sampling for cheap metrics, lower for expensive LLM-based metrics)

Online scoring runs asynchronously with zero latency impact, but costs still accumulate based on volume. Balance coverage with evaluation costs.

Common pitfalls

Over-optimizing single metrics

Optimizing only for factuality might produce dry, technically correct but unhelpful answers. Balance multiple quality dimensions.

Ignoring edge cases

Metrics averaged across your test set can hide systematic failures on specific input types. Analyze score distributions. Identify low-scoring subgroups.

Using inappropriate metrics

Exact match doesn't work for open-ended questions. Semantic similarity doesn't distinguish between different correct answers. Match metrics to your use case.

Insufficient test data

Running metrics on 10 examples doesn't reveal patterns. You need hundreds to thousands of diverse test cases to reliably measure quality.

Not versioning scorers

Scorer changes affect metric values. Track scorer versions alongside prompt versions to ensure apples-to-apples comparisons.

Get started with LLM metrics

Braintrust provides complete infrastructure for implementing and tracking evaluation metrics:

Get started with Braintrust for free with 1 million trace spans included, no credit card required.

Frequently asked questions

What's the difference between metrics and scorers?

In practice, the terms are used interchangeably. Technically, a metric is what you're measuring (factuality, relevance) while a scorer is the implementation that measures it. In Braintrust, scorers are functions that return scores for specific metrics.

Should I use code-based or LLM-based scorers?

Use code-based scorers whenever possible because they're faster, cheaper, and deterministic. Use LLM-based scorers for subjective criteria that code can't capture: tone, creativity, nuanced accuracy. Many applications benefit from both types.

How many evaluation metrics should I track?

Start with 2-3 metrics covering your most critical quality dimensions. Add more as needed, but avoid tracking metrics you won't act on. More metrics mean more complexity in interpretation and higher evaluation costs.

Can I use the same metrics for development and production?

Yes, but sampling strategies differ. In development, run all metrics on your full test suite during evaluation. In production, configure online scoring rules with appropriate sampling rates (1-10% for high-volume applications, higher for low volume). Set up multiple rules with different sampling rates to prioritize inexpensive metrics at higher rates and expensive LLM-based metrics at lower rates.

How do I know if my custom scorer is reliable?

Validate against human judgments. Have humans score 100-200 examples. Compare human scores to your scorer's outputs. Calculate correlation. If alignment is poor, refine the scorer prompt or logic. Iterate until scores match human judgment reasonably well.