Skip to main content
Human review is essential for evaluating AI applications. Braintrust integrates feedback from end users, subject matter experts, and product teams, letting you evaluate experiments, assess automated scoring, and curate evaluation datasets. Human review label

Configure review scores

Define the scores you want to collect in your project’s Configuration tab: Human Review Configuration Select Add human review score to configure a new score. Choose from three types:
  • Continuous scores: Numeric values between 0% and 100% with a slider input control. Use for subjective quality assessments like helpfulness or tone.
  • Categorical scores: Predefined options with assigned scores. Each option gets a unique percentage value between 0% and 100% (stored as 0 to 1). Use for classification tasks like sentiment or correctness categories.
  • Free-form text: String values written to the metadata field at a specified path. Use for explanations, corrections, or structured feedback.
Create modal Created scores appear in the Human review section of every experiment and log trace in your project.

Write to expected fields

Configure categorical scores to write to the expected field instead of creating a score. This is useful for labeling ground truth data. To enable:
  1. Check Write to expected field instead of score.
  2. Optionally enable Allow multiple choice for multi-label classification.
Numeric scores are not assigned when writing to expected fields. If an object exists in the expected field, the categorical value appends to it.
Write to expected Categorical scores configured to “write to expected” and free-form scores also appear on dataset rows for labeling. You can always directly edit the structured output for the expected field of any span through the UI.

Review logs and experiments

Select any row to open trace view and edit configured human review scores: Scores save automatically and update summary metrics in real time. The process works identically for logs and experiments.

Leave comments

Add comments to spans alongside scores and expected values. Updates are tracked to form an audit trail of edits. Copy links to comments to share with teammates. Comments are searchable using the Filter menu.

Use focused review mode

For reviewing large batches, use Review mode optimized for rapid evaluation. Enter review mode by pressing “r” or selecting the expand icon next to the Human review header. Review mode features:
  • Set scores, comments, and expected values
  • Keyboard navigation for speed
  • Shareable links that open directly in review mode

Review filtered data

Filter logs or experiments using natural language or SQL, then enter review mode to evaluate matching items: Use tags to mark items for “Triage”, then review them all at once. Save filters, sorts, and column configurations as views for standardized review workflows. Views update dynamically with new rows matching criteria. Views combine with review mode for optimal productivity:
  • Designed for optimal productivity: Intuitive filters, reusable configurations, and keyboard navigation enable fast and efficient evaluation.
  • Dynamic and flexible views: Views dynamically update with new rows matching saved criteria, without requiring complex automation rules.
  • Easy collaboration: Share review mode links for team collaboration without intricate permissions or setup overhead.

Create review queues

The Review list is a centralized queue showing all spans marked for review across your project. This complements focused reviews by giving you a curated queue of items that need attention, regardless of where they appear in your project. To mark spans for review:
  1. Select Flag for review in the span header.
  2. Bulk select rows and flag them together.
  3. Optionally assign to specific users.
Navigate to Review in the sidebar to see all flagged spans.

Review in context

When you open a span in the list, you’ll see it in the context of its full trace. This allows you to understand the span’s role within the larger request and review parent and child spans for additional context. Mark spans as Complete when finished or navigate to the next item in the queue.

Filter by scores

Find logs with specific scores using the filter menu or API:
await braintrust.projects.logs.fetch(projectId, {
  query: "scores.Preference > 0.75"
});
Use this to add highly-rated examples to datasets or investigate low-scoring patterns.

Next steps