How to evaluate your agent with Gemini 3

18 November 2025Braintrust Team

Google recently announced Gemini 3, their latest model family with significant improvements in reasoning, tool use, and multimodal capabilities. Like every major model release, the announcement came with impressive benchmarks. But benchmarks rarely reflect real-world performance, especially for agent applications where models need to plan, use tools correctly, and handle complex multi-step workflows.

When a new model launches, the question facing every AI team is the same: Should we upgrade? Will this actually improve our agent's performance, or will it introduce regressions we'll only discover in production? The answer requires systematic evaluation using your actual data, not generic benchmarks.

The agent evaluation challenge

Agent applications differ fundamentally from simple LLM calls. Agents make decisions across multiple steps: planning what to do, choosing which tools to use, constructing arguments correctly, processing results, and synthesizing final answers. Each step introduces potential failure points. A model might excel at general reasoning benchmarks but struggle with the specific tool schemas your application requires.

This complexity means you cannot evaluate agent models the same way you evaluate simple completion tasks. You need to measure both end-to-end performance and individual step quality. Did the agent choose the right tool? Were the parameters correct? Did it handle errors gracefully? These questions require detailed visibility into agent execution.

Establish your baseline

Before testing Gemini 3 or any new model, you need to understand your current agent's performance. Without a baseline, you cannot measure if changes represent improvements or regressions.

Start by running an evaluation with your existing model. The basic structure requires three components:

Data: Test cases representing real scenarios your agent handles
Task: Your agent code that processes inputs
Scorers: Functions that measure quality

For agent evaluation, your scorers should measure multiple dimensions:

Check if the agent selected appropriate tools for each scenario
Validate that tool arguments match expected schemas
Measure factual accuracy of final responses
Track whether the agent completed tasks successfully without errors

Once you run this baseline evaluation, you have concrete metrics. These numbers become your reference point for comparing new models.

Use production data for testing

Generic benchmarks tell you how models perform on academic tasks. Production data tells you how they perform on your actual use cases. The gap between these two can be substantial.

To evaluate Gemini 3 against real scenarios, pull logs from your production agent. Focus on cases where your current model underperforms. Low-scoring traces, failed tool calls, and user-reported issues all make excellent test cases. These represent the problems you actively need to solve.

Braintrust captures every agent trace in production, showing the full decision tree: which tools the agent considered, what parameters it constructed, how it processed results. You can add any production trace to a dataset with one click, building test suites directly from real usage patterns.

This approach surfaces issues that benchmarks miss. Your agent might need to work with specific APIs, handle domain-specific terminology, or follow particular formatting requirements. Testing with production data validates that Gemini 3 handles these nuances correctly.

Compare models systematically

With your baseline established and production dataset ready, testing Gemini 3 becomes straightforward. If you use the Braintrust AI proxy for your agent's LLM calls, switching models requires changing one parameter.

Run the same evaluation tasks against both your current model and Gemini 3. Compare results across multiple dimensions: overall success rate, tool selection accuracy, response quality scores, latency, and cost. The evaluation UI shows these metrics side by side, making it easy to spot improvements and regressions.

Pay particular attention to tool usage patterns. Newer models like Gemini 3 often show different behaviors around tool calling. They might choose tools more or less aggressively, construct parameters differently, or handle tool errors with different strategies. These behavioral differences can significantly impact agent reliability even when overall accuracy improves.

Group your results by specific scenarios or tool types. A model might improve performance on retrieval tasks while regressing on calculation tasks. This granular view helps you make informed decisions about whether the overall trade-offs benefit your application.

Deploy with confidence

If evaluation shows Gemini 3 outperforms your current model without introducing critical regressions, you can deploy the change confidently. The systematic testing means you are not guessing whether the new model will work better. You have measured proof using your actual data and evaluation criteria.

Deployment through the AI proxy requires updating one line of code. The same infrastructure that made testing easy makes production deployment equally straightforward. You do not need to refactor integrations or modify your agent logic.

After deployment, continue monitoring agent performance in production. Enable online evaluation to score real requests in real time. Track the same metrics you validated in offline testing: tool selection accuracy, response quality, completion rates. This confirms that production behavior matches your evaluation results.

If you notice quality degradation or unexpected patterns, production traces feed directly back into your dataset. Add problematic cases to your test suite, validate fixes in evaluation, then deploy updates. The cycle becomes continuous: production informs evaluation, evaluation validates improvements, improvements ship to production.

Monitor and iterate

New model deployments are not one-time switches. They are ongoing processes requiring continuous measurement. After switching to Gemini 3, use production monitoring to track performance over time.

The monitoring dashboard shows trends across key metrics. Group results by model to compare Gemini 3 against your previous baseline. Tighten the timeline to focus on the period after your model change. This reveals whether performance improvements from evaluation continue in production at scale.

Watch for patterns in low-scoring traces. These might indicate edge cases your evaluation dataset missed, scenarios where Gemini 3 exhibits unexpected behavior, or areas where additional prompt tuning could help. Add these cases to your dataset so future evaluations catch similar issues.

As your agent evolves and handles new use cases, your evaluation criteria should evolve too. Add new scorers for emerging requirements. Expand datasets to cover new scenarios. The infrastructure that validated Gemini 3 becomes the foundation for validating every future change.

Beyond Gemini 3

This evaluation approach applies to every new model release, not just Gemini 3. When Claude Sonnet 5, or other frontier models launch, you will have a repeatable process: establish baseline with current model, test new model against production data, compare results systematically, deploy if improvements outweigh regressions, monitor production continuously.

The teams shipping AI agents fastest are those who treat model evaluation like software testing: systematic, automated, and continuous. They do not manually test every new model release. They run their evaluation suite, review results, make data-driven decisions, and move on to building features.

Braintrust connects this complete workflow. Production traces become test datasets. Evaluations validate changes before deployment. The AI proxy makes model swapping trivial. Production monitoring confirms results match expectations. Each piece reinforces the others, creating a development loop where you ship improvements confidently without breaking existing functionality.

When the next major model launches, you will not need to guess if it improves your agent. You will measure it with your data, compare it against your criteria, and decide based on evidence. That is how the best AI teams operate, and it is the infrastructure Braintrust provides.

Start evaluating your agent

Braintrust makes it easy to quickly test new models to see if they improve your agent. See how scores change with different models and experiment in the playground. Sign up for Braintrust and run your first agent evaluation in under an hour to compare Gemini 3 against your baseline with real production data.

FAQs

How do I evaluate my AI agent when a new model like Gemini 3 launches?

Start by establishing a baseline with your current model using production data. Create an evaluation dataset from real agent traces that represent typical scenarios and edge cases. Run evaluations measuring both end-to-end task completion and individual step quality (tool selection accuracy, parameter correctness, response quality). Then test Gemini 3 using the same dataset and scorers, comparing results systematically across metrics like success rate, tool usage patterns, latency, and cost before deciding whether to deploy.

What makes agent evaluation different from regular LLM evaluation?

Agent evaluation requires measuring multi-step workflows where models make sequential decisions: planning, tool selection, argument construction, result processing, and answer synthesis. Unlike simple prompt completion, agents can fail at any step in this chain. You need visibility into each decision point to understand whether the agent chose appropriate tools, constructed valid parameters, handled errors correctly, and synthesized accurate final responses. This requires detailed tracing and step-level scoring in addition to end-to-end metrics.

Can I test Gemini 3 without changing my production code?

Yes. If you use the Braintrust AI proxy for your agent's LLM calls, testing Gemini 3 requires changing only the model parameter in your evaluation code. The proxy provides a standardized OpenAI-compatible interface, so you can swap between Gemini 3, GPT models, Claude, and other providers without modifying your agent logic. This makes A/B testing new models straightforward and lets you compare performance systematically before updating production.

How do I know if Gemini 3 is better for my specific agent use case?

Generic benchmarks tell you how models perform on academic tasks, but only your production data reveals performance on your actual use cases. Pull real agent traces from production, especially ones where your current model underperforms. Test Gemini 3 against these scenarios using your specific evaluation criteria: Does it choose the right tools for your application? Does it handle your particular APIs and data schemas correctly? Does it maintain quality on your domain-specific requirements? The results show whether Gemini 3 improves your agent, not agents in general.

What metrics should I track when evaluating agents with new models?

Track both outcome metrics and process metrics. Outcome metrics include task completion rate, final response quality scores (factuality, helpfulness, accuracy), user satisfaction, and business KPIs. Process metrics include tool selection accuracy (did the agent choose appropriate tools), parameter correctness (were tool arguments valid), step efficiency (unnecessary steps or retries), error handling quality, latency per step, and cost per request. Comparing both categories across models reveals whether improvements in one area come at the cost of regressions in another.

How can I use production data to build better agent evaluations?

Production traces capture real user interactions, edge cases, and failure patterns that synthetic test data misses. Braintrust logs every agent execution with full visibility into decision trees, tool calls, and intermediate steps. Filter for low-scoring traces, failed tool invocations, or user-reported issues and add them directly to evaluation datasets. This ensures your test suite reflects actual usage patterns. As you collect more production data, continuously expand your evaluation datasets so testing stays aligned with how users actually interact with your agent.

Should I evaluate Gemini 3 on all tasks or focus on specific capabilities?

Start by focusing on areas where your current model underperforms or where Gemini 3 claims specific improvements. If Google highlights enhanced reasoning or better tool use in Gemini 3, prioritize evaluating complex multi-step tasks and tool-heavy workflows. Run comprehensive evaluations on your full test suite to catch unexpected regressions in areas where your current model performs well. Group results by task type (retrieval, calculation, generation, etc.) to understand where Gemini 3 excels and where it might regress compared to your baseline.

How do I evaluate Gemini 3 tool calling accuracy for my agent?

Create evaluation cases that cover your agent's tool usage patterns: common tools the agent should use frequently, edge case scenarios requiring less common tools, ambiguous situations where multiple tools could apply, and cases where no tool is appropriate. Score whether Gemini 3 selected the correct tool, whether it constructed valid arguments matching your tool schemas, whether it handled tool errors appropriately, and whether it correctly processed tool results. Compare these metrics against your current model to see if Gemini 3 improves tool usage reliability.

What is the fastest way to compare Gemini 3 against my current agent model?

Use Braintrust's evaluation framework with the AI proxy. Pull a representative dataset from production traces (start with 50-100 examples covering common scenarios and known failure cases). Run your agent evaluation against both your current model and Gemini 3 in parallel by changing only the model parameter. The evaluation UI shows side-by-side comparisons across all metrics: success rates, quality scores, latency, cost, and detailed trace-level results. Most teams complete initial comparisons in under an hour, getting immediate clarity on whether Gemini 3 improves their specific use case.

How do I handle regression testing when switching to Gemini 3?

Build a comprehensive evaluation dataset that covers both positive cases (scenarios your current model handles well) and negative cases (known failure patterns). When testing Gemini 3, compare results across both categories. Improvements on failure cases are good, but watch for regressions on previously successful scenarios. Set quality thresholds that new models must meet: overall success rate cannot drop below baseline, critical tool selections must maintain accuracy, response quality on core use cases must stay consistent. Only deploy Gemini 3 if it improves problem areas without breaking what already works.

Can I gradually roll out Gemini 3 to my agent instead of switching completely?

Yes. After offline evaluation validates that Gemini 3 improves performance, you can deploy it gradually using traffic splitting or A/B testing. Route a small percentage of production traffic to Gemini 3 while keeping most users on your current model. Monitor both populations with online evaluation, tracking the same metrics you validated offline. If production results match evaluation predictions, increase the Gemini 3 traffic percentage. If you observe unexpected issues, roll back instantly. This de-risks model changes and provides real-world validation at scale.

What are the best practices for monitoring agent performance after deploying Gemini 3?

Enable online evaluation to score real production requests continuously. Track the same metrics you used in offline testing: tool selection accuracy, response quality, task completion rates. Set up alerts for quality drops below acceptable thresholds. Group monitoring results by model so you can compare Gemini 3 performance against your previous baseline over time. Watch for drift (performance degrading gradually) and regression patterns (specific scenarios where Gemini 3 consistently underperforms). Feed problematic production traces back into your evaluation dataset to catch similar issues before future deployments.