Key questions for evaluating agents
When evaluating sophisticated agent behaviors, ask questions like:- If the agent starts by providing a plan of actions to take in answering the user’s query, does that plan make sense given the user’s objective?
- If the agent provides reasoning steps, are those intermediate thoughts expected?
- Did the agent choose the correct next step or defer to a human as expected?
- Did the agent invoke the correct tools?
- When invoking a tool, did the agent properly build up the arguments to invoke it?
- When examining a tool’s output, did the agent properly utilize it to provide an answer or move to the next expected step?
Errors can surface at any point in an agentic system. To debug and understand these errors it’s important to capture the inputs at each step as well as the outputs.
Types of evaluations
Offline evaluations
Offline evaluations proactively identify issues in agent behavior before deployment. These function similarly to unit tests or integration tests, emphasizing reproducibility and stability. You can use datasets to test both the end-to-end performance of your agent and its intermediate steps. For instance, you might create a specific dataset to test a retrieval step in a RAG pipeline, or one that checks whether generated SQL adheres to security constraints. Once you’ve created a “golden dataset” with ground truth examples, you can apply either code-based scorers or LLM-as-a-judge scorers to evaluate outputs systematically. Recommended approach:- Stub external dependencies: Snapshot sufficient state from production or staging environments to simulate databases, APIs, and infrastructure.
- Isolate specific agent actions: Create deterministic scenarios to evaluate critical behaviors reliably.
- Assess incremental behavior: Evaluate individual agent steps, including tool calls, parameter accuracy, and responses.
Online evaluations
Online evaluations continuously monitor real-time performance, capturing live user interactions, and diagnosing issues as they arise. Here, there is no ground truth to evaluate the overall performance of the agent or any of its steps, so in general, we rely on LLM-as-a-judge scorers for evaluation. Recommended approach:- Real environment usage: Always evaluate in your actual production environment for accurate user experience insights.
- Incorporate user feedback: Allow users to like or dislike agent responses and provide comments. This can be invaluable for error analysis and informed sampling traces for evaluation. Refer to the user feedback docs for implementation details.
- Real-time scoring: Implement continuous monitoring for key behaviors like hallucinations, tool accuracy, and goal completion. More information is available in the online scoring documentation.
- Adaptive sampling: Start by scoring all requests, then adjust sample rates based on agent stability and usage volume. For details on how to control sampling from your logs, check out the online scoring docs.
- Feedback integration: Use both low-scoring and anomalously high-scoring examples to feed new test scenarios into offline evaluations.
Structuring agent evaluations
End-to-end:- Use real or simulated environments to evaluate complete task flows.
- Focus on goal success, coherence, and robustness.
hooks argument in your eval’s task function to add the results to your
trace’s metadata, which can then be used in any of your eval’s scorers to evaluate the final output, like this:
- Use snapshotted scenarios with stubbed infrastructure to test specific decisions in isolation.
- Make sure you include the inputs from the preceding step as sometimes a “step failure” may really be due to a problem with the previous step’s output.
- Target precise behaviors, ensuring reproducibility and reliability.
tool call:
To see the full example, check out the API Agent cookbook.
Additional resources
- An agent that runs OpenAPI commands
- Using functions to build a RAG agent
- A field guide to rapidly improving AI products
Designing comprehensive agent evaluations
For agents managing complex, multi-step interactions, make sure evaluations account for variability and context-dependence:- Snapshotting state: Capture tool calls and responses from live environments for accurate offline evaluation scenarios.
- Incremental assessment: Evaluate each step individually to manage non-deterministic agent interactions effectively.
- Goal-oriented evaluation: For complex sequences, prioritize evaluations based on the agent’s ultimate success or failure in achieving its intended outcome.
Evolving your evaluation suite
Evaluations should evolve alongside your agent’s behavior and product goals.1
Start with simple scenarios
Start with simple scenarios, using stubbed environments to isolate key decisions.
2
Add complex flows
Add complex flows using simulated or real data to test agents under realistic conditions.
3
Define custom success criteria
For data-intensive agents (for example, manipulating and loading data into databases), define custom success criteria, like:
- Schema compliance
- Data transformation correctness
- Deterministic output formats
4
Use continuous feedback loops
- Iterate on scorers
- Expand your dataset coverage
- Adapt to new agent workflows