Mental models
Scorers are a crucial element of both offline and online evaluations:- Offline evaluations are used to proactively identify and resolve issues before deployment.
- Online evaluation involves running scorers on live requests to diagnose problems, monitor performance, and capture user feedback in real-time.
1
Define clear criteria
Before beginning to write scorers, clearly identify the criteria users will use to evaluate the generated output.
You can start by defining:
- Input: The data or prompt given to the model.
- Output: The expected result from the model.
- Accuracy of information
- Conciseness
- Clarity and readability
- Appropriate tone
- Correct grammar and spelling
- Bias and safety
- Adherence to specific formatting
In more complex, agentic workflows, it’s possible that each step will have its own inputs and outputs. This just means that you might have different criteria, and therefore different scorers, for each step. Braintrust will automatically aggregate scores across spans for each trace.
2
Apply common quality checks
You will certainly have success criteria that are unique to your product and use case, but many evaluation scenarios also benefit from common quality checks. Check out this list of common checks, and verify if they apply to your use case:
- Relevance: Does the output reflect the source input accurately?
- Readability: Is the language clear and easy to understand?
- Structure and formatting: Does the output follow required formats, such as structured lists or JSON schemas?
- Factuality: Is the provided information correct and verifiable?
- Safety: Is the content free from biased or offensive language?
- Language accuracy: Does the output match the requested language?
3
Automate with code-based checks
Where possible, implement deterministic quality checks through code-based scoring functions. Code-based scorers are reliable and consistent, execute quickly and efficiently, and reduce variability from human or model judgments. Code-based scorers in Braintrust can be written in either TypeScript or Python, via either the UI or SDK. They return a score between
0 and 1.Some examples of code-based checks include:- Verifying valid JSON structure
- Checking text length constraints (for example, less than 100 characters)
- Ensuring outputs match predefined patterns (for example, a bullet-point list of exactly three items)
Schema validation libraries like
pydantic or jsonschema are useful for formatting requirements.4
Develop and align LLM-based scorers
For more subjective and nuanced criteria that code can not capture, like tone appropriateness or creativity, you can use LLM-based scorers.When building these, it’s important to:When you create your LLM-based scorer, you will assign each choice in the rubric to a specific score between 0 and 1. Binary scoring is often recommended as it’s easier to define and creates less confusion among human reviewers during alignment. However, when you need more nuanced evaluation, be sure to clearly explain what each choice score corresponds to like in the example above.To calibrate your LLM-based scorer, test it on a small but representative dataset that covers edge cases, different user personas, and a good variety of inputs. Compare the results with human spot checks to make sure they are aligned.
- Design judge prompts with explicit instructions, examples of good vs. bad outputs, and a clear scoring rubric
- Use chain of thought to understand why the model is assigning a specific score
- Use more granular scoring when necessary
- Choose the model that is best suited for the evaluation, which may be different from the model used in the task
LLMs can also help you generate good scorer prompts.
In Braintrust, you can enable chain of thought (CoT) with a toggle or flag from the UI or SDK, respectively.
5
Iterate on your initial set of criteria
Scorer development is an ongoing process. After assessing your initial scorers, you should review low-score outputs to identify missing criteria or edge-case behaviors. Based on what you find, you can refine your definitions and add new scorers for uncovered aspects. You can also rerun the calibration step on an expanded example set, and adjust prompts, model providers, or code as needed.By tightly coupling development, evaluation, and refinement, you can make sure that your scorers stay aligned with evolving product needs and user inputs.
Best practices for scorer design
- Provide clear rationale: When using language-model-based scorers, enable detailed rationale explanations to understand scoring decisions and refine scorer behavior.
- Single-aspect scorers: Create separate scorers for each distinct evaluation aspect, such as accuracy versus style.
- Weighted scoring: Use weighted averages when combining scores, prioritizing critical criteria over less important ones.
- Appropriate scoring scales: Match the scoring scale to evaluation complexity. Use binary scoring (yes/no) for simple checks and multi-point scales for nuanced assessments.
Evaluating agents
When evaluating agents, scorers should assess not only individual responses but also overall agent behavior and performance:- Goal completion: Did the agent accomplish the assigned task?
- Efficiency: Did the agent complete the task within acceptable resource or time constraints?
- Interaction quality: Was the interaction coherent, helpful, and aligned with user expectations?
- Error handling: Did the agent handle unexpected situations gracefully and recover effectively?
For more information on evaluating agents, check out the full guide.
Benefits of effective scorers
By following a structured evaluation cycle (define, implement, evaluate, refine), you can:- Get closer to deterministic model behavior
- Quickly iterate and improve AI features
- Scale evaluations without manual overhead