LLM evaluation metrics: Full guide to LLM evals and key metrics
Evaluation metrics turn subjective AI quality into measurable numbers. Without metrics, you rely on manual review and intuition. You can't systematically improve what you can't measure.
This guide covers evaluation metrics for LLMs: what they measure, when to use them, and how to implement them systematically. We'll explore metrics for general LLM outputs, RAG applications, and specialized use cases, with practical implementation examples.
Why evaluation metrics matter
The measurement problem
AI outputs are non-deterministic and subjective. The same prompt can produce different responses. Quality depends on context, user intent, and domain-specific requirements. Traditional software testing (checking for exact matches or return codes) doesn't work.
You need metrics that capture quality dimensions relevant to your use case: Is the answer factually correct? Is it relevant to the question? Does it follow instructions? Is it safe and appropriate?
Without metrics, quality evaluation becomes manual and slow. Someone reads each output, judges it subjectively, and records their assessment. This doesn't scale. You can't test thousands of examples. You can't track quality over time. You can't identify which prompt changes improve performance.
What good metrics enable
Systematic improvement: Metrics make quality measurable. You can test prompt changes and know if they improved factuality, relevance, or coherence. Iterate based on data, not guesswork.
Regression detection: Track metrics across versions. When factuality drops from 85% to 72%, you know something broke. Without metrics, regressions surface as user complaints.
A/B testing: Compare prompt variants quantitatively. Variant A scores 0.83 on relevance, variant B scores 0.91. Deploy B. Without metrics, you can't make data-driven decisions.
Continuous monitoring: Run metrics on production traffic. Detect quality degradation in real-time. Alert when scores drop below thresholds. Respond before users notice problems.
Categories of evaluation metrics
Metrics fall into different categories based on what they measure and how they're implemented.
Task-agnostic vs. task-specific metrics
Task-agnostic metrics apply broadly across use cases:
- Factuality: Is the information correct?
- Coherence: Does the text flow logically?
- Safety: Is the content appropriate?
- Fluency: Is the language well-formed?
These work for most LLM applications without customization.
Task-specific metrics measure criteria unique to your application:
- For customer support: Did the response solve the user's problem?
- For code generation: Does the code compile and pass tests?
- For summarization: Does the summary capture key points from the source?
- For SQL generation: Does the query return the expected results?
Task-specific metrics require domain knowledge to implement correctly.
Code-based vs. LLM-based metrics
Code-based metrics use deterministic logic:
- Exact match: Does output exactly equal expected?
- Levenshtein distance: How many edits to match expected?
- JSON validity: Is the output valid JSON?
- Length constraints: Is the output within bounds?
- Pattern matching: Does output match required format?
Code-based metrics are fast, cheap, and deterministic. Use them whenever possible.
LLM-based metrics use language models to judge outputs:
- Factuality: An LLM verifies claims against provided context
- Relevance: An LLM judges if output addresses the input
- Tone: An LLM evaluates appropriateness
- Creativity: An LLM assesses originality
LLM-based metrics (LLM-as-a-judge) handle nuanced, subjective criteria that code can't capture. They cost more and add variability, but enable evaluation of complex quality dimensions.
Reference-based vs. reference-free metrics
Reference-based metrics compare output to an expected answer:
- Exact match
- Levenshtein distance
- BLEU score
- Semantic similarity
These require ground truth data. Use when you have known correct answers.
Reference-free metrics evaluate outputs independently:
- Coherence
- Fluency
- Safety
- JSON validity
Use when there's no single correct answer or when expected outputs aren't available.
Core LLM evaluation metrics
Factuality
Measures whether the output contains accurate, verifiable information. Critical for applications providing factual answers, summarization, or question answering.
How Braintrust measures factuality: Use the Factuality scorer from Braintrust's autoevals library. This LLM-as-a-judge scorer compares output against source context to determine if claims are supported by the provided facts.
Use cases:
- RAG systems where accuracy is critical
- Customer support providing factual information
- Document summarization
- Educational applications
Limitations: Requires reliable source context. Judge model accuracy depends on its own knowledge and instruction-following capability.
Relevance
Evaluates whether the output appropriately addresses the input. A factually correct answer that doesn't address the question scores low on relevance.
How Braintrust measures relevance: Braintrust provides multiple relevance scorers through the autoevals library for different use cases:
Use cases:
- Search and retrieval systems
- Question answering
- Conversational AI
- Any application where staying on-topic matters
Coherence and fluency
Coherence measures logical flow and consistency within the text. Ideas should connect naturally. Arguments should follow logically. Pronouns should reference correctly.
Fluency measures grammatical correctness and naturalness. The text should read smoothly without awkward phrasing or errors.
How Braintrust measures coherence and fluency: Create custom LLM-as-a-judge scorers in Braintrust for these subjective qualities:
Use cases:
- Content generation
- Long-form text
- Creative writing
- Marketing copy
Safety and moderation
Evaluates whether output is safe, appropriate, and free from harmful content. Checks for toxicity, bias, offensive language, and policy violations.
How Braintrust measures safety: Create custom code-based scorers to pattern match against blocked words/phrases, or create LLM-as-a-judge scorers to evaluate against a specific set of safety criteria.
Use cases:
- Public-facing chatbots
- Content generation for platforms
- Educational applications
- Any application where inappropriate content creates risk
Categories Braintrust checks:
- Toxicity
- Hate speech
- Sexual content
- Violence
- Self-harm
- Security vulnerabilities
Semantic similarity
Measures how close the output is to the expected answer in meaning, regardless of exact wording. "The capital of France is Paris" and "Paris is France's capital" score high similarity despite different phrasing.
How Braintrust measures semantic similarity: Use Braintrust's EmbeddingSimilarity scorer from autoevals:
Use cases:
- Question answering where multiple phrasings are acceptable
- Comparing system responses across versions
- Regression testing where exact matches aren't required
Limitations: Doesn't distinguish between different correct answers with similar embeddings. Can give false positives for semantically similar but factually wrong answers.
Exact match and string distance
Exact match: Binary metric. Does output exactly match expected?
Levenshtein distance: Counts minimum edits (insertions, deletions, substitutions) to transform output into expected text. Normalized to 0-1 score.
How Braintrust measures exact match and string distance: Use Braintrust's code-based scorers from autoevals for fast, deterministic evaluation:
Use cases:
- Structured outputs (IDs, codes, formatted data)
- Cases with single canonical answers
- Regression tests
- Baseline metrics before using more sophisticated evaluation
Limitations: Brittle. "The answer is 42" and "42" score poorly despite same semantic content.
RAG-specific evaluation metrics
RAG (Retrieval-Augmented Generation) systems have unique evaluation requirements. Braintrust provides specialized scorers through autoevals to measure both retrieval quality and generation quality.
Context precision
Measures whether retrieved context contains relevant information. How much of the retrieved content actually helps answer the query?
How Braintrust measures context precision: Use the ContextPrecision scorer from Braintrust's autoevals library:
Use cases:
- Tuning retrieval parameters (top-k, similarity thresholds)
- Comparing retrieval strategies
- Identifying when retrieval is pulling irrelevant content
Context recall
Measures whether the retrieved context contains all information needed to answer the query. Did retrieval miss important documents?
How Braintrust measures context recall: Use the ContextRecall scorer from Braintrust's autoevals library:
Use cases:
- Optimizing retrieval coverage
- Diagnosing why the system produces incomplete answers
- Balancing precision and recall in retrieval
Context relevance
Evaluates whether retrieved documents are topically related to the query. Similar to precision but focused on topical relevance rather than direct usefulness.
How Braintrust measures context relevance: Use the ContextRelevancy scorer shown earlier in the Relevance section. Braintrust evaluates whether each retrieved document relates to the query.
Faithfulness
Measures whether the generated answer is grounded in the retrieved context. Does the LLM hallucinate information not present in the context?
How Braintrust measures faithfulness: Use the Faithfulness scorer from Braintrust's autoevals library:
Use cases:
- Detecting hallucinations
- Ensuring answers stay grounded in source material
- Applications where accuracy is critical (legal, medical, financial)
Answer correctness
Combines factual accuracy with completeness. Is the answer both correct and complete?
How Braintrust measures answer correctness: Use the AnswerCorrectness scorer from Braintrust's autoevals library:
Use cases:
- End-to-end RAG evaluation
- Comparing retrieval + generation pipelines
- Regression testing
Specialized metrics
JSON validity
Checks if output is valid JSON that can be parsed.
How Braintrust measures JSON validity: Use the JSONDiff scorer from Braintrust's autoevals library or create a custom code-based scorer:
Use cases:
- Structured output generation
- API response generation
- Function calling
- Any application requiring parseable data structures
SQL correctness
Evaluates whether generated SQL queries are syntactically valid and semantically correct.
How Braintrust measures SQL correctness: Create custom code-based scorers in Braintrust for multi-level SQL validation:
Use cases:
- Text-to-SQL applications
- Natural language database queries
- Business intelligence tools
Numeric difference
Measures how close a numeric output is to the expected value, normalized to 0-1 score.
How Braintrust measures numeric difference: Use the NumericDiff scorer from Braintrust's autoevals library:
Use cases:
- Math problem solving
- Calculations and computations
- Quantitative reasoning
Implementing metrics with Braintrust
Braintrust provides comprehensive infrastructure for implementing, tracking, and acting on evaluation metrics.
Built-in metrics through autoevals
Braintrust uses the autoevals library to provide 25+ pre-built scorers you can use immediately. These scorers cover common LLM evaluation scenarios:
LLM-as-a-judge scorers in Braintrust:
- Factuality - verify claims against source context
- Security - detect prompt injection and PII leaks
- Moderation - check for toxic, harmful content
- Summarization - evaluate summary quality
- Translation - assess translation accuracy
- Humor - measure comedic effectiveness
- Battle - A/B compare two outputs
RAG scorers in Braintrust:
- Context precision - measure retrieval relevance
- Context relevancy - evaluate topical relatedness
- Context recall - check if all needed info was retrieved
- Context entity recall - verify entity coverage
- Faithfulness - detect hallucinations
- Answer relevancy - assess if output addresses query
- Answer similarity - compare semantic closeness
- Answer correctness - measure accuracy and completeness
Heuristic scorers in Braintrust:
- Exact match - binary comparison
- Levenshtein distance - edit distance scoring
- Numeric difference - mathematical accuracy
- JSON diff - structured output validation
- Embedding similarity - semantic comparison
Use Braintrust's built-in metrics out of the box:
Custom code-based scorers in Braintrust
Braintrust supports custom code-based scorers for domain-specific requirements. These scorers are fast, deterministic, and cost nothing to run:
When to use custom code-based scorers in Braintrust:
- Format validation (email, phone numbers, URLs)
- Business rule compliance (price ranges, character limits)
- Regex pattern matching (code syntax, structured formats)
- Length constraints (min/max characters)
- Required field presence (must include certain keywords)
- Data type validation (numbers, dates, boolean values)
Custom LLM-as-a-judge scorers in Braintrust
Braintrust enables custom LLM-based scorers using LLMClassifierFromTemplate from autoevals. Create scorers for subjective, domain-specific quality criteria:
When to use custom LLM-as-a-judge scorers in Braintrust:
- Industry-specific accuracy (medical, legal, financial correctness)
- Tone and style evaluation (professional, casual, empathetic)
- Brand voice compliance (matches company guidelines)
- Completeness checks (addresses all parts of multi-part questions)
- Reasoning quality (logical, well-structured arguments)
- Domain expertise demonstration (shows technical knowledge)
Tracking metrics across experiments
Run evals with multiple scorers. Braintrust tracks all metrics for each test case and aggregates results:
- View score distributions across your dataset
- Filter test cases by score to find failures
- Compare metrics across prompt versions or model changes
- Track metric trends over time
Metrics in CI/CD
Braintrust's GitHub Action integration runs evaluations automatically on every pull request. View metric changes directly in PR comments:
Braintrust supports quality gates that block merges if metrics degrade below thresholds, preventing quality regressions before code ships.
Online scoring in production
Run scorers on production traffic to monitor quality continuously. Online scoring evaluates traces asynchronously in the background after they're logged, with zero impact on request latency.
Configuration: Set up scoring rules at the project level through Braintrust's Configuration page. Each rule specifies:
- Which scorer to apply (autoevals or custom)
- Sampling rate (1-10% typical for high-volume apps, 100% for low volume)
- BTQL filter to select which spans to score
- Span type (root spans, specific names, or all)
How it works: When your application logs traces normally, Braintrust automatically applies configured scoring rules in the background. Scores appear as child spans in your logs with evaluation details.
After deployment, configure online scoring rules in the UI. Scoring happens automatically based on your rules without code changes.
Manual scoring: You can also score historical logs retroactively through the UI by selecting logs and applying scorers, useful for testing new evaluation criteria before enabling them as online rules.
Best practices for evaluation metrics
Start with simple metrics
Begin with exact match or string distance. Establish baselines. Add sophistication when simple metrics prove insufficient.
Combine multiple metrics
No single metric captures all quality dimensions. Use combinations:
- Factuality + relevance for Q&A
- Context precision + faithfulness for RAG
- Safety + coherence + fluency for content generation
Calibrate LLM-based scorers
LLM-as-a-judge introduces variability. Validate scorers against human judgments. Adjust prompts if scores don't align with human evaluation.
Use chain-of-thought in scorer prompts to understand reasoning. This helps debug score disagreements.
Use task-specific metrics
Generic metrics provide broad coverage but miss use-case-specific quality dimensions. Invest in custom metrics for criteria that matter to your application:
- Customer support: problem resolution rate
- Code generation: compilation success + test passage
- Creative writing: originality + engagement
Track metrics over time
Metrics matter most when tracked across iterations. Compare scores before and after prompt changes, model updates, or retrieval modifications. Trend lines reveal gradual quality shifts that point-in-time evaluation misses.
Sample appropriately for online scoring
Scoring every production request can be expensive. Configure online scoring rules with sampling rates based on:
- Traffic volume (1-10% typical for high-volume applications)
- Request characteristics (use BTQL filters to score specific span types or metadata)
- Metric cost (higher sampling for cheap metrics, lower for expensive LLM-based metrics)
Online scoring runs asynchronously with zero latency impact, but costs still accumulate based on volume. Balance coverage with evaluation costs.
Common pitfalls
Over-optimizing single metrics
Optimizing only for factuality might produce dry, technically correct but unhelpful answers. Balance multiple quality dimensions.
Ignoring edge cases
Metrics averaged across your test set can hide systematic failures on specific input types. Analyze score distributions. Identify low-scoring subgroups.
Using inappropriate metrics
Exact match doesn't work for open-ended questions. Semantic similarity doesn't distinguish between different correct answers. Match metrics to your use case.
Insufficient test data
Running metrics on 10 examples doesn't reveal patterns. You need hundreds to thousands of diverse test cases to reliably measure quality.
Not versioning scorers
Scorer changes affect metric values. Track scorer versions alongside prompt versions to ensure apples-to-apples comparisons.
Get started with LLM metrics
Braintrust provides complete infrastructure for implementing and tracking evaluation metrics:
- Autoevals library with 25+ pre-built scorers
- Custom scorer creation through UI or SDK
- Playground for testing metrics interactively
- CI/CD integration for metric tracking on every PR
- Online scoring for production monitoring
Get started with Braintrust for free with 1 million trace spans included, no credit card required.
Frequently asked questions
What's the difference between metrics and scorers?
In practice, the terms are used interchangeably. Technically, a metric is what you're measuring (factuality, relevance) while a scorer is the implementation that measures it. In Braintrust, scorers are functions that return scores for specific metrics.
Should I use code-based or LLM-based scorers?
Use code-based scorers whenever possible because they're faster, cheaper, and deterministic. Use LLM-based scorers for subjective criteria that code can't capture: tone, creativity, nuanced accuracy. Many applications benefit from both types.
How many evaluation metrics should I track?
Start with 2-3 metrics covering your most critical quality dimensions. Add more as needed, but avoid tracking metrics you won't act on. More metrics mean more complexity in interpretation and higher evaluation costs.
Can I use the same metrics for development and production?
Yes, but sampling strategies differ. In development, run all metrics on your full test suite during evaluation. In production, configure online scoring rules with appropriate sampling rates (1-10% for high-volume applications, higher for low volume). Set up multiple rules with different sampling rates to prioritize inexpensive metrics at higher rates and expensive LLM-based metrics at lower rates.
How do I know if my custom scorer is reliable?
Validate against human judgments. Have humans score 100-200 examples. Compare human scores to your scorer's outputs. Calculate correlation. If alignment is poor, refine the scorer prompt or logic. Iterate until scores match human judgment reasonably well.