Encyclopedia Evalica / Evaluation / Benchmark

Benchmark
/'behnch.mahrk/A standardized dataset and eval setup used to compare model or system performance over time or across approaches. Benchmarks help track progress, but can be gamed or become less representative of real-world usage. (noun)
Why it matters
Public benchmarks like MMLU or HumanEval are useful for comparing models against each other, but they rarely reflect the specific inputs, edge cases, or quality bar that matter for your product. A model that scores well on a generic benchmark can still fail on your domain-specific queries. The challenge is that building a custom benchmark from scratch is expensive. The practical approach is to start with a small, curated dataset drawn from real production data, define scorers that measure the dimensions you care about, and treat that as your internal benchmark. You can then expand it over time as you discover new failure modes. Standardized benchmarks still have value as a sanity check when evaluating new models, but your deployment decisions should be driven by performance on your own data. Running your custom benchmark as part of every experiment gives you a stable reference point for measuring whether changes are actually improving the experience for your specific use case.
“We're using last quarter's benchmark to see whether the new prompt actually improved accuracy.”
Customer example
Browserbase publishes public Braintrust eval dashboards benchmarking "computer-use" models on real browser tasks via Stagehand; they treat benchmarks as a starting point and encourage teams to add custom evals for their own workflows. Read more
Related Evaluation terms
- Absolute scoring •
- Agent •
- AI eval •
- Alignment •
- Annotation schema •
- Baseline •
- Baseline experiment •
- Calibration •
- CI/CD integration •
- Coherence •
- Confidence interval •
- Eval harness •
- Eval leakage •
- Experiment •
- Factuality •
- Failure mode •
- Faithfulness •
- Feedback signal •
- Groundedness •
- Hallucination •
- Inter-annotator agreement (IAA) •
- LLM-as-a-judge •
- Loop •
- Model comparison •
- Multimodal •
- Non-determinism •
- Offline evaluation •
- Pairwise evaluation •
- Pass@k •
- Playground •
- Quality gate •
- RAG (retrieval-augmented generation) •
- RAG evaluation •
- Reference-based scoring •
- Reference-free scoring •
- Regression testing •
- Release criteria •
- Remote evaluation •
- Rubric •
- Safety •
- Score distribution •
- Scorer •
- Semantic failure •
- Signal-to-noise ratio •
- Task (eval task) •
- Toxicity score
From the docs
Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.
Start building