Encyclopedia Evalica / Evaluation / Agent

Agent
/'ay.juhnt/An AI system that takes actions autonomously by deciding which tools to call, what order to execute steps, and when to stop. Agent behavior is non-deterministic and often involves multiple model calls per interaction. (noun)
Why it matters
Agents introduce eval challenges that single-prompt applications do not have. When an agent decides which tools to call, in what order, and when to stop, the same input can produce different execution paths on every run. That non-determinism means you cannot simply diff outputs against a fixed expected answer. You need to evaluate the trajectory, not just the final result. Did the agent call the right tools? Did it recover from errors? Did it stop at the right time? You also need tracing that captures every step so you can debug failures that happen deep in a multi-turn chain. Standard unit tests and snapshot comparisons fall short here because the space of valid behaviors is wide. Teams building agents successfully tend to combine path-level evals, production tracing, and targeted datasets that cover tool-use combinations, rather than relying on end-to-end pass/fail checks alone.
“The agent decided to query SQL first, then summarize the results for the user.”
Customer example
Notion evolved from simple prompt-and-judge setups to "agentic evaluation," building evals that cover path-finding and combinatorial tool-use behaviors so teams can iterate based on production traces instead of intuition. Read more
Related Evaluation terms
- Absolute scoring •
- AI eval •
- Alignment •
- Annotation schema •
- Baseline •
- Baseline experiment •
- Benchmark •
- Calibration •
- CI/CD integration •
- Coherence •
- Confidence interval •
- Eval harness •
- Eval leakage •
- Experiment •
- Factuality •
- Failure mode •
- Faithfulness •
- Feedback signal •
- Groundedness •
- Hallucination •
- Inter-annotator agreement (IAA) •
- LLM-as-a-judge •
- Loop •
- Model comparison •
- Multimodal •
- Non-determinism •
- Offline evaluation •
- Pairwise evaluation •
- Pass@k •
- Playground •
- Quality gate •
- RAG (retrieval-augmented generation) •
- RAG evaluation •
- Reference-based scoring •
- Reference-free scoring •
- Regression testing •
- Release criteria •
- Remote evaluation •
- Rubric •
- Safety •
- Score distribution •
- Scorer •
- Semantic failure •
- Signal-to-noise ratio •
- Task (eval task) •
- Toxicity score
From the docs
Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.
Start building