How to evaluate voice agents

5 November 2025Braintrust Team

Voice AI agents are becoming the interface for customer support, sales, appointments, and automated assistance. But unlike traditional chatbots, voice agents introduce unique evaluation challenges: speech quality, conversational flow, latency, accent handling, and real-time decision-making all matter. Shipping a voice agent without systematic evaluation means discovering issues only when users start complaining.

The question is: How do you know if your voice agent actually works?

Why voice agents are different

Text-based AI agents operate in a controlled environment. You send structured input, get structured output, and can easily replay interactions. Voice agents face constant chaos: background noise interferes with speech recognition, accents and speech patterns vary wildly, latency destroys conversational flow, and users interrupt, change topics mid-sentence, and expect natural dialogue instead of robotic responses.

Voice agent evaluation requires measuring speech recognition quality, conversational coherence, response latency, natural language understanding across accents and noise levels, and graceful handling of interruptions and context switches. These challenges go beyond the accuracy and factuality metrics used for traditional AI evaluation.

Each layer introduces failure modes that text-based systems never encounter. Perfect language models still fail when speech recognition breaks down. Sound logic still fails when response latency causes users to hang up. Evaluating voice agents means evaluating the entire pipeline, including every component from audio input to audio output.

The voice agent stack

Understanding what to evaluate starts with understanding the components:

Speech-to-text (STT) converts audio input into text. Quality depends on accent handling, background noise resilience, and domain-specific vocabulary recognition.

Natural language understanding (NLU) interprets user intent from transcribed text. This includes entity extraction, context tracking across turns, and handling ambiguous or incomplete inputs.

Decision logic and tool use determines what action to take. Voice agents often need to query databases, call APIs, or execute multi-step workflows while maintaining conversation context.

Response generation creates appropriate replies based on conversation state. This includes determining tone, choosing information to include, and deciding when to ask clarifying questions.

Text-to-speech (TTS) converts responses into natural-sounding audio. Quality factors include voice naturalness, pronunciation accuracy, and emotional appropriateness.

Conversation management orchestrates the entire flow. This includes turn-taking, interruption handling, context retention, and knowing when to escalate to human support.

Each component can fail independently. Systematic evaluation requires testing each layer and measuring end-to-end performance.

What to measure in voice agent evaluations

Speech recognition accuracy

STT errors cascade through your entire system. If "I need to cancel my order" becomes "I need to channel my border," your agent will fail regardless of how good your logic is.

Measure word error rate (WER) across different accents and dialects, performance in noisy environments, accuracy on domain-specific terminology, and handling of speech disfluencies (ums, ahs, restarts).

Create evaluation datasets that reflect real user audio. Record samples from your target demographic with various background noise levels. Include edge cases: strong accents, technical terms, rapid speech, and natural conversation patterns with hesitations and corrections.

Intent classification and entity extraction

Once speech is transcribed, your agent needs to understand what users want. Misclassified intent means wrong actions, frustrated users, and wasted conversation turns.

Test intent classification accuracy across phrasings, entity extraction precision (dates, numbers, names), context retention across conversation turns, and handling of ambiguous or multi-intent utterances.

Voice conversations come with natural messiness. Users say things like "Yeah so I was thinking maybe I should probably cancel that thing from yesterday or was it Tuesday?" Your agent needs to extract the actual intent (cancel order) and relevant entities (date ambiguity requiring clarification).

Response quality and appropriateness

Generated responses need to be accurate, conversational, and contextually appropriate. Robotic responses destroy user trust. Overly verbose responses waste time. Responses that ignore context break conversation flow.

Evaluate factual accuracy against your knowledge base, conversational tone and naturalness, appropriate response length (voice has different constraints than text), and correct use of context from previous turns.

A good voice agent response sounds human. It acknowledges what the user said, provides relevant information concisely, and advances the conversation toward resolution. Test your responses by listening to them instead of only reading transcripts.

Latency and real-time performance

Latency kills voice conversations. Users expect responses within 1-2 seconds. Longer delays feel broken, cause users to repeat themselves, and destroy conversational flow.

Measure time from speech end to response start, consistency of response timing across different queries, and performance degradation under load.

Latency testing requires production-like conditions. Run load tests that simulate realistic traffic patterns. Monitor tail latencies, especially p99 latency, because users remember bad experiences more than average performance.

Task completion and goal success

Ultimately, voice agents exist to accomplish tasks. Users call to book appointments, get support, make purchases, or retrieve information. Success means completing these goals efficiently.

Track task completion rate (did the user achieve their goal?), average conversation turns to completion, need for human escalation, and user satisfaction (when measurable).

Create test scenarios that mirror real use cases. If your agent handles appointment booking, test the full flow: availability check, slot selection, confirmation, and follow-up. Measure success against realistic criteria like "did the user get an appointment" rather than just "did the agent respond."

Conversation flow and coherence

Voice conversations have natural rhythm and structure. Agents that interrupt, lose context, or fail to track conversation state frustrate users.

Evaluate appropriate turn-taking (not interrupting users), context retention across multiple turns, graceful handling of topic changes, and knowing when to ask clarifying questions versus making assumptions.

Test your agent with realistic dialogue patterns. Users change their minds, add information mid-conversation, and expect agents to remember what was said three turns ago. Your evaluation data should reflect this complexity.

Building a voice agent evaluation framework

Generate realistic test data

The quality of your evaluations depends on the quality of your test data. Synthetic data is useful for coverage, but real user interactions reveal actual failure modes.

Start with production logs. Record and transcribe real conversations (with appropriate consent and privacy protections). Tag interesting cases: successful resolutions, failures, edge cases, and unusual patterns.

Supplement with synthetic data to ensure coverage. Use LLMs to generate diverse phrasings for common intents. Create TTS audio from generated text to test the full pipeline. Include edge cases that might not appear frequently in production: rare accents, technical terminology, extreme noise levels, and adversarial inputs.

The Braintrust voice agent cookbook demonstrates generating synthetic evaluation data using TTS models to create audio samples across multiple languages and voices, providing coverage that production logs alone might not capture.

Create language-specific evaluation datasets

Voice agents often need to handle multiple languages or dialects. Don't assume that performance on English transfers to Spanish, Mandarin, or Hindi.

Build evaluation datasets for each target language. Include native speakers when recording real audio samples. Test language detection for multilingual agents. Verify that intent classification works across languages (idioms and cultural context differ).

Braintrust makes it easy to organize evaluation datasets by language, accent, or any other metadata dimension. Tag each example with relevant attributes, then filter and compare performance across segments.

Test both individual components and end-to-end flows

Voice agents function as complete systems with many interconnected components. Test each component in isolation to diagnose failures precisely, and test end-to-end flows to catch integration issues and measure real user experience.

For component-level testing, evaluate STT accuracy on recorded audio, test intent classification on transcribed text, verify response generation logic with mocked inputs, and measure TTS quality independently.

For end-to-end testing, simulate complete user conversations, measure task completion across full dialogues, test conversation state management, and verify handling of interruptions and corrections.

Braintrust traces capture every step in your agent's execution. When end-to-end performance degrades, you can drill into individual spans to identify whether the issue was STT, intent classification, or response generation. This visibility accelerates debugging and iteration.

Implement continuous monitoring in production

Offline evaluation catches many issues, but production reveals problems you didn't anticipate. Users behave differently than test scenarios predict. Unexpected accents appear. Network conditions vary. Real-world chaos breaks assumptions.

Monitor key metrics in production: STT confidence scores (flag low-confidence transcriptions), intent classification confidence, response latency (track p50, p95, p99), task completion rate, and escalation to human support frequency.

Set up alerts for degradation. If STT accuracy drops suddenly, your speech recognition service might have issues. If intent classification confidence decreases, your users might be asking about topics outside your training data. If latency spikes, your infrastructure might be overloaded.

Braintrust supports online evaluation with continuous scoring in production. Define scorers that measure quality in real-time, sample requests based on traffic volume, and automatically flag low-scoring conversations for offline analysis.

Use attachments to log and debug audio

Audio contains crucial context that text-based logs cannot capture. To truly understand what happened, you need to hear the original audio.

Braintrust attachments let you log raw audio alongside transcriptions and metadata. When reviewing evaluation results or debugging production issues, you can listen to the actual audio, compare against transcriptions, and identify STT failures that text alone wouldn't reveal.

The voice agent cookbook shows how to create audio attachments and log them with traces:

python

from braintrust import Attachment, current_span

# Log audio attachment
attachment = Attachment(
    data=audio_bytes,
    filename="customer_audio.wav",
    content_type="audio/wav",
)
current_span().log(input={"audio_attachment": attachment})

This simple pattern makes your voice agent evaluations transparent and debuggable. Instead of staring at transcripts wondering what went wrong, you can listen to the audio and see exactly where your pipeline failed.

Practical evaluation strategies

Start with golden datasets

Before building complex evaluation pipelines, create a small set of high-quality examples that represent your most important use cases. These "golden datasets" serve as regression tests and calibration tools.

Include successful conversations that should always work, known failure cases you've fixed, edge cases that stress specific components, and adversarial examples that might break your agent.

Run every change against your golden dataset. If a prompt modification breaks a previously working case, you'll know immediately. If a new model improves golden dataset performance, that's a strong signal for deployment.

Evaluate prompts in context of full conversations

Voice agents often use prompts to guide response generation. But evaluating prompts in isolation misses critical context. A great prompt for turn 1 might fail at turn 5 when conversation history grows.

Test prompts with realistic conversation state. Include prior turns in your evaluation setup. Measure how performance degrades as conversation length increases. Verify that prompts handle context switching appropriately.

Braintrust playgrounds let you test prompt variations side-by-side on real conversation traces. Load a conversation from production, modify the system prompt, and see exactly how responses change. This rapid iteration cycle turns prompt engineering from guesswork into measurement.

Use LLM-as-a-judge for subjective qualities

Some evaluation criteria resist simple automation. Conversational naturalness, empathy, and appropriateness require human-like judgment. Building manual review processes for every evaluation run doesn't scale.

LLM-as-a-judge scorers provide scalable evaluation for subjective qualities. Define clear rubrics for what "good" means. Give the judge model examples of excellent versus poor responses. Use chain-of-thought to understand scoring decisions.

For voice agents, you might create scorers that evaluate conversational tone ("Is this response friendly and professional?"), contextual appropriateness ("Does this response acknowledge what the user just said?"), or information completeness ("Does this response fully address the user's question?").

Braintrust supports custom scorers built directly in the platform. Define your rubric, test it on sample data, and apply it across evaluation runs. As your product evolves, refine your scorers to reflect new quality criteria.

Measure real user satisfaction when possible

The ultimate evaluation signal is whether users are satisfied. When feasible, collect explicit feedback: post-call surveys, thumbs up/down buttons, or follow-up satisfaction scores.

Combine explicit feedback with implicit signals. Task completion indicates satisfaction. Immediate escalation to human support suggests failure. Short conversations with resolution indicate efficiency.

Tag production traces with satisfaction signals. Feed dissatisfied conversations back into offline evaluation datasets. This creates a continuous improvement loop: production failures become test cases, test cases drive improvements, improvements ship to production.

Evaluating voice agents with Braintrust

Braintrust provides purpose-built infrastructure for voice agent evaluation that spans the full development lifecycle.

Production logging and trace capture

Every voice agent interaction generates a trace capturing STT output, intent classification, tool calls, response generation, and TTS input. Braintrust logs these traces automatically when you wrap your API clients.

Traces include timing information for latency analysis, confidence scores for quality monitoring, full conversation context, and support for audio attachments.

This visibility makes debugging production issues straightforward. When a user reports a bad interaction, search for their conversation, inspect the trace, and see exactly where your pipeline failed.

Curated datasets from production

The best evaluation data comes from real user interactions. Braintrust makes it trivial to convert production traces into evaluation datasets.

Spotted an interesting failure? Click to add it to a dataset. Want to test a fix? Filter for similar cases and create a focused dataset. Need to evaluate a new feature? Tag relevant production examples and build a dataset in minutes.

This tight coupling between production and evaluation ensures your tests reflect reality. Your evaluations target actual user behavior rather than synthetic scenarios you think matter.

Offline and online evaluation

Offline evaluation catches regressions before deployment. Define your data, task, and scorers. Run experiments to compare prompt variations, model changes, or architectural updates. Review results in detail before shipping changes.

Online evaluation monitors production performance. Define scorers that run on live requests. Sample based on traffic volume and score distribution. Alert when quality degrades. Feed low-scoring examples back into offline datasets.

Both evaluation modes use the same infrastructure. Write your scorers once, apply them offline and online. This eliminates the gap between testing and monitoring.

Experiment comparison and regression detection

Voice agent development involves constant iteration: new prompts, updated models, architectural changes. Knowing whether changes improved quality requires precise measurement.

Braintrust experiment comparisons show exactly what changed. Compare score distributions across experiments. Drill into specific test cases that regressed or improved. Filter by metadata to understand performance across segments (languages, intent types, conversation lengths).

Automated regression detection flags cases that got worse. Before deploying a change, review regressions to ensure they're acceptable tradeoffs. This prevents silent quality degradation.

Loop for automated evaluation development

Building comprehensive evaluation suites takes time. Writing scorers, generating datasets, and analyzing results are iterative processes.

Loop, Braintrust's built-in AI agent, automates the time-intensive parts. Loop can generate evaluation datasets tailored to voice agent use cases, suggest scorers based on your quality criteria, analyze production logs to identify failure patterns, and optimize prompts by understanding current performance and suggesting improvements.

Loop integrates directly into Braintrust with full context on your logs, datasets, experiments, and prompts. This integration enables Loop to suggest improvements based on actual production data rather than generic advice.

Case study: Building a multilingual support agent

Consider a customer support voice agent that handles inquiries across five languages. Evaluation needs to verify speech recognition accuracy per language, intent classification across language-specific phrasings, response quality and cultural appropriateness, and task completion regardless of language.

Start by creating language-specific datasets. Record native speakers asking common support questions. Include various accents within each language. Add background noise at different levels to test robustness.

Define scorers that measure language-specific quality. Use exact match for intent classification (did we identify the right category?). Use LLM-as-a-judge for response quality (is this culturally appropriate? Does it answer the question?). Track latency to ensure reasonable response times.

Run experiments comparing STT services, intent models, and prompt variations. Group results by language to identify where performance varies. If Spanish intent classification lags English, investigate whether training data is sufficient or if phrasings differ more than expected.

Deploy with online monitoring. Track STT confidence per language. Flag low-confidence transcriptions for human review. Measure task completion and escalation rates by language. Feed problematic conversations back into evaluation datasets.

The Braintrust voice agent cookbook provides a complete implementation of multilingual language classification evaluation, showing how to generate synthetic audio, classify with GPT-4o audio models, and analyze results grouped by language.

Common pitfalls and how to avoid them

Testing only happy paths

It's tempting to focus evaluation on ideal scenarios: clear audio, standard accents, simple requests. Production is messy. Users have strong accents, call from noisy environments, and ask ambiguous questions.

Build evaluation datasets that stress your system. Include non-native speakers, background noise, rapid speech, and ambiguous requests. Test failure modes: what happens when STT fails completely? When intent classification is uncertain? When users provide contradictory information?

Robust voice agents handle edge cases gracefully. Your evaluation should verify this.

Ignoring latency until production

Latency problems are expensive to fix after deployment. Infrastructure changes, model swaps, and architectural updates take time. Catching latency issues during development is much cheaper.

Include latency in every evaluation run. Set acceptable thresholds and fail tests that exceed them. Profile your pipeline to identify bottlenecks. Test under realistic load, not just single-request scenarios.

Voice agent latency has hard constraints. Users won't tolerate 5-second delays. Build this requirement into your evaluation from day one.

Evaluating components in isolation only

Component-level tests are necessary but insufficient. STT might be perfect on clean audio, intent classification might work on transcripts, and response generation might produce good text, yet the integrated system might still fail.

Run end-to-end evaluations that exercise the full pipeline. Simulate realistic conversations with multiple turns. Test conversation state management. Verify that context flows correctly through the system.

Braintrust traces make end-to-end evaluation transparent. Every span is visible and evaluable. When end-to-end performance degrades, drill into specific components to diagnose root causes.

Neglecting data diversity

If your evaluation data comes from one demographic, accent, or use case, you're only testing a narrow slice of reality. Production will reveal gaps.

Actively seek diverse evaluation data. Record samples from different regions, age groups, and native languages. Include technical users and non-technical users. Cover peak hours and off-peak hours (when background noise might differ).

Tag data with demographic and context attributes. Run experiments segmented by these dimensions. If performance degrades for older users or non-native speakers, you'll catch it before deployment.

Continuous improvement through evaluation

Voice agent quality isn't a one-time achievement. User expectations evolve. New use cases emerge. Speech recognition technology improves. Staying competitive requires systematic iteration.

Build evaluation into your development workflow. Run offline evals on every PR. Review experiment results before deploying changes. Monitor production metrics continuously. Feed production failures back into offline evaluation datasets.

This continuous loop compounds improvements over time. Production traces reveal real-world failures. Offline evaluations catch regressions. Experiments compare alternatives systematically. Each iteration makes your agent measurably better.

Getting started with voice agent evaluation

Start small and build systematically. Begin by logging production traces with full context and conversation state. Curate a dataset of 50-100 real conversations representing your most important use cases. Define scorers for your core quality metrics (STT accuracy, intent classification, task completion).

Run your first offline evaluation comparing your current prompt against a variation. Deploy the winner with online monitoring. Feed low-scoring conversations back into your dataset. Repeat the cycle.

As your evaluation infrastructure matures, expand coverage. Add more scorers. Build language-specific datasets. Test edge cases. Automate regression detection. Focus on systematic, measurable improvement over time rather than perfection on day one.

Resources and next steps

Frequently asked questions

How do I evaluate voice AI agents?

Voice agent evaluation requires measuring multiple components including speech recognition accuracy, intent classification, response quality, latency, and end-to-end task completion. Build evaluation datasets from real user audio, define scorers for each quality dimension, and run both offline experiments and online monitoring.

What metrics should I track for voice agents?

Key metrics include word error rate (STT accuracy), intent classification accuracy, response latency, task completion rate, user satisfaction, and escalation to human support frequency. Track these metrics both offline during development and online in production.

How do I test voice agents across different languages and accents?

Create language-specific evaluation datasets with native speakers representing various accents. Tag data with language and demographic attributes. Run experiments segmented by these dimensions. Monitor production performance grouped by language to identify gaps.

How can I debug voice agent failures?

Use traces to capture every step in the voice agent pipeline. Log audio attachments alongside transcriptions and responses. When failures occur, replay the trace to see exactly where the pipeline broke: STT, intent classification, response generation, or TTS.

What's the difference between offline and online voice agent evaluation?

Offline evaluation tests changes before deployment using curated datasets and systematic comparisons. Online evaluation monitors live production traffic with continuous scoring and automated alerting. Both are necessary for comprehensive quality assurance.

How do I measure conversational quality?

Use LLM-as-a-judge scorers with clear rubrics defining what makes a good conversation. Evaluate naturalness, contextual appropriateness, empathy, and information completeness. Supplement automated scoring with human review for subjective qualities.

How often should I run voice agent evaluations?

Run offline evaluations on every code change before deployment. Monitor production continuously with online evaluation. Review results weekly to identify trends. Feed production failures back into offline datasets for continuous improvement.