Human review

Human review is a critical part of the evaluation process.

Although Braintrust helps you automatically evaluate AI software, human review is a critical part of the process. Braintrust seamlessly integrates human feedback from end users, subject matter experts, and product teams in one place. You can use human review to evaluate/compare experiments, assess the efficacy of your automated scoring methods, and curate log events to use in your evals. As you add human review scores, your logs will update in real time.

Human review label

Configure human review

To set up human review, define the scores you want to collect in your project's Configuration tab.

Human Review Configuration

Select Add human review score to configure a new score. A score can be one of

Continuous number value between 0% and 100%, with a slider input control.
Categorical value where you can define the possible options and their scores. Categorical value options are also assigned a unique percentage value between 0% and 100% (stored as 0 to 1).
Free-form text where you can write a string value to the metadata field at a specified path.

Create modal

Created human review scores will appear in the Human review section in every experiment and log trace in the project. Categorical scores configured to "write to expected" and free-form scores will also appear on dataset rows.

Write to expected fields

You may choose to write categorical scores to the expected field of a span instead of a score. To enable this, check the Write to expected field instead of score option. There is also an option to Allow multiple choice when writing to the expected field.

A numeric score will not be assigned to the categorical options when writing to the expected field. If there is an existing object in the expected field, the categorical value will be appended to the object.

Write to expected

In addition to categorical scores, you can always directly edit the structured output for the expected field of any span through the UI.

Review logs and experiments

To manually review results from your logs or experiment, select a row to open trace view. There, you can edit the human review scores you previously configured.

As you set scores, they will be automatically saved and reflected in the summary metrics. The process is the same whether you're reviewing logs or experiments.

Leave comments

In addition to setting scores, you can also add comments to spans and update their expected values. These updates are tracked alongside score updates to form an audit trail of edits to a span.

If you leave a comment that you want to share with a teammate, you can copy a link that will deeplink to the comment.

Focused review mode

If you or a subject matter expert is reviewing a large number of logs or experiments, you can use Review mode to enter a UI that's optimized specifically for review. To enter review mode, hit the "r" key or the expand () icon next to the Human review header in a span.

In review mode, you can set scores, leave comments, and edit expected values. Review mode is optimized for keyboard navigation, so you can quickly move between scores and rows with keyboard shortcuts. You can also share a link to the review mode view with other team members, and they'll drop directly into review mode.

Review data that matches a specific criteria

To easily review a subset of your logs or experiments that match a given criteria, you can filter using English or BTQL, then enter review mode.

In addition to filters, you can use tags to mark items for Triage, and then review them all at once.

You can also save any filters, sorts, or column configurations as views. Views give you a standardized place to see any current or future logs that match a given criteria, for example, logs with a Factuality score less than 50%. Once you create your view, you can enter review mode right from there.

Reviewing is a common task, and therefore you can enter review mode from any experiment or log view. You can also re-enter review mode from any view to audit past reviews or update scores.

Dynamic review with views

Designed for optimal productivity: The combination of views and human review mode simplifies the review process with intuitive filters, reusable configurations, and keyboard navigation, enabling fast and efficient evaluation and feedback.
Dynamic and flexible views: Views dynamically update with new rows matching saved criteria, without requiring the need to set up and maintain complex automation rules.
Easy collaboration: Sharing review mode links allows for team collaboration without requiring intricate permissions or setup overhead.

Select spans for review

The Review list is a centralized annotation queue to see all spans that have been marked for review across your project. This complements focused reviews by giving you a curated queue of items that need attention, regardless of where they appear in your project.

To mark a span for review, select Flag for review in the span header. You can also bulk select rows that need review and select Flag for review. Additionally, you can assign spans to specific users so they can view all spans pending their review.

Navigate to Review from the sidebar to see all marked spans across your project.

Review in context

When you open a span in the list, you'll see it in the context of its full trace. This allows you to understand the span's role within the larger request and review parent and child spans for additional context.

Once you've finished reviewing a span, you can mark it as Complete or navigate to the next item in the queue.

Filter using feedback

In the UI, you can filter on log events with specific scores by adding a filter using the filter button, like "Preference is greater than 75%", and then add the matching rows to a dataset for further investigation.

You can also programmatically filter log events using the API using a query and the project ID:

await braintrust.projects.logs.fetch(projectId, { query });

This is a powerful way to utilize human feedback to improve your evals.

Capture end-user feedback

The same set of updates — scores, comments, and expected values — can be captured from end-users as well. See the Logging guide for more details.

On this page