Encyclopedia Evalica / Evaluation / Loop
Loop
/loop/Braintrust's AI assistant that helps teams write scorers, iterate on prompts, generate datasets, run evals, and surface patterns in production data. Loop reduces the friction of going from "something broke" to "we shipped a measured fix." (noun)
Why it matters
Most AI assistants and chatbots operate without context about your specific system, your data, or your eval history. Loop is different because it has access to your production traces, datasets, scorers, and experiment results. This means it can help you write a scorer by looking at actual examples from your logs, suggest prompt improvements based on patterns in your failure cases, or generate dataset records grounded in real production behavior. The value is in reducing the friction of the eval workflow. Writing a good scorer from scratch requires understanding your data and your quality bar. Generating a useful dataset requires knowing what kinds of inputs your system actually sees. An assistant with production context can do both faster than starting from a blank slate. Loop helps teams go from noticing a problem in their traces to having a measured fix in less time, by handling the mechanical parts of the iteration loop so you can focus on the judgment calls.
“Loop suggested a scorer to detect answers that contradict the source context.”
Customer example
Retool uses Loop to move from isolated bug reports to systemic patterns in production traces and validate fixes against focused datasets. Read more
Related Evaluation terms
- Absolute scoring •
- Agent •
- AI eval •
- Alignment •
- Annotation schema •
- Baseline •
- Baseline experiment •
- Benchmark •
- Calibration •
- CI/CD integration •
- Coherence •
- Confidence interval •
- Eval harness •
- Eval leakage •
- Experiment •
- Factuality •
- Failure mode •
- Faithfulness •
- Feedback signal •
- Groundedness •
- Hallucination •
- Inter-annotator agreement (IAA) •
- LLM-as-a-judge •
- Model comparison •
- Multimodal •
- Non-determinism •
- Offline evaluation •
- Pairwise evaluation •
- Pass@k •
- Playground •
- Quality gate •
- RAG (retrieval-augmented generation) •
- RAG evaluation •
- Reference-based scoring •
- Reference-free scoring •
- Regression testing •
- Release criteria •
- Remote evaluation •
- Rubric •
- Safety •
- Score distribution •
- Scorer •
- Semantic failure •
- Signal-to-noise ratio •
- Task (eval task) •
- Toxicity score
From the docs
Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.
Start building