Capture user feedback

User feedback from production provides invaluable signal for building evaluation datasets and identifying areas for improvement. Capture feedback systematically to create high-quality test cases that reflect real user needs.

Why capture feedback

Production feedback helps you:

Build datasets from actual user interactions
Identify edge cases and failure modes
Understand user preferences and expectations
Validate that improvements work for real users

Feedback captured in production flows directly into your annotation workflow, making it easy to curate datasets and iterate.

Types of feedback

Braintrust supports multiple feedback types that you can combine:

Scores: Thumbs up/down, star ratings, or custom numeric values
Expected values: User corrections showing what the output should be
Comments: Free-form explanations or context
Metadata: Structured data like user ID, session ID, or feature flags

See Capture user feedback in the Instrument section for implementation details.

Build datasets from feedback

Once you’ve captured feedback, use it to create evaluation datasets:

Filter by feedback scores

Use the filter menu to find traces with specific feedback:

WHERE scores.user_rating > 0.8

WHERE metadata.thumbs_up = true

WHERE comment IS NOT NULL AND scores.correctness < 0.5

Copy to datasets

After filtering:

Select the traces you want to include.
Select Add to dataset.
Choose an existing dataset or create a new one.

This workflow lets you build “golden datasets” from highly-rated examples or create test suites from problematic cases.

Generate with Loop

Ask Loop to create datasets based on feedback patterns: Example queries:

“Create a dataset from logs with positive feedback”
“Generate a dataset from user corrections”
“Build a dataset from cases where users clicked thumbs down”

Use feedback for human review

Production feedback complements internal review:

Configure review scores

Set up review scores that match your production feedback. For example, if you capture thumbs up/down in production, configure a matching categorical score for internal review. This consistency lets you compare user feedback with expert assessments.

Review low-scoring traces

Filter for traces with poor user feedback and enter review mode:

Apply filter: WHERE scores.user_rating < 0.3 (SQL) or filter: scores.user_rating < 0.3 (BTQL).
Enter Review mode.
Add internal scores and comments.
Update expected values.

This helps you understand why users were dissatisfied and what the correct output should be.

Track feedback patterns

Use dashboards to monitor feedback trends:

User satisfaction over time
Feedback distribution by feature or user segment
Correlation between automated scores and user feedback
Common feedback themes (via comments)

See Monitor with dashboards for details on creating custom charts.

Iterate on improvements

Close the feedback loop:

Capture feedback from production users.
Annotate traces with expected values and labels.
Build datasets from annotated examples.
Evaluate changes using those datasets.
Deploy improvements and monitor feedback again.

This cycle ensures you’re optimizing for what users actually care about.

Next steps

Build datasets from user feedback
Configure human review to add internal assessments
Add labels to categorize feedback patterns
Run evaluations using feedback-derived datasets

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Capture user feedback

Why capture feedback

Types of feedback

Build datasets from feedback

Filter by feedback scores

Copy to datasets

Generate with Loop

Use feedback for human review

Configure review scores

Review low-scoring traces

Track feedback patterns

Iterate on improvements

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Why capture feedback

​Types of feedback

​Build datasets from feedback

​Filter by feedback scores

​Copy to datasets

​Generate with Loop

​Use feedback for human review

​Configure review scores

​Review low-scoring traces

​Track feedback patterns

​Iterate on improvements

​Next steps

Why capture feedback

Types of feedback

Build datasets from feedback

Filter by feedback scores

Copy to datasets

Generate with Loop

Use feedback for human review

Configure review scores

Review low-scoring traces

Track feedback patterns

Iterate on improvements

Next steps