- Inline in SDK code: Define scorers directly in your evaluation scripts for local development or application-specific logic.
- Pushed via CLI: Define scorers in TypeScript or Python files and push them to Braintrust for team-wide sharing and automatic evaluation of production logs.
- Created in UI: Build scorers in the Braintrust web interface using the built-in code editor.
Score spans
Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score. Your scorer function receives these parameters:input: The input to your taskoutput: The output from your taskexpected: The expected output (optional)metadata: Custom metadata from the test case
score and optional metadata.
In Ruby, declare only the parameters you need as keyword arguments. The runner automatically filters out the rest: |output:, expected:|.
- SDK
- CLI
- UI
Use scorers inline in your evaluation code:
equality_scorer.eval.ts
Score traces
Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, overall workflow completion, or when your scorer needs access to the full execution context. The scorer runs once per trace. Your handler function receives thetrace parameter, which provides methods for accessing execution data:
-
Get spans: Returns spans matching the filter. Each span includes
input,output,metadata,span_id, andspan_attributes. Omit the filter to get all spans, or pass multiple types like["llm", "tool"].- TypeScript:
trace.getSpans({ spanType: ["llm"] }) - Python:
trace.get_spans(span_type=["llm"]) - Ruby:
trace.spans(span_type: "llm")
- TypeScript:
-
Get thread: Returns an array of conversation messages extracted from LLM spans.
- TypeScript:
trace.getThread() - Python:
trace.get_thread() - Ruby:
trace.thread
- TypeScript:
input, output, expected, and metadata are automatically populated from the root span and passed to your scorer function.
Trace-level scoring requires TypeScript SDK v2.2.1+, Python SDK v0.5.6+, or Ruby SDK v0.2.1+.
- SDK
- CLI
- UI
Use scorers inline in your evaluation code:
trace_code_scorer.eval.ts
Set pass thresholds
Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting).- SDK
- UI
Add
__pass_threshold to the scorer’s metadata (value between 0 and 1):Return multiple scores
A single scorer can return an array of score objects to emit multiple named metrics from one call. This is useful when several quality dimensions can be computed together or share computation. Each item appears as its own score column in the Braintrust UI. Each item requiresname and score. metadata is optional.
Next steps
- Autoevals for pre-built scorers without writing code
- LLM-as-a-judge for natural language evaluation criteria
- Run evaluations using your scorers
- Score production logs with online scoring rules