Skip to main content
User feedback from production provides invaluable signal for building evaluation datasets and identifying areas for improvement. Capture feedback systematically to create high-quality test cases that reflect real user needs.

Why capture feedback

Production feedback helps you:
  • Build datasets from actual user interactions
  • Identify edge cases and failure modes
  • Understand user preferences and expectations
  • Validate that improvements work for real users
Feedback captured in production flows directly into your annotation workflow, making it easy to curate datasets and iterate.

Types of feedback

Braintrust supports multiple feedback types that you can combine:
  • Scores: Thumbs up/down, star ratings, or custom numeric values
  • Expected values: User corrections showing what the output should be
  • Comments: Free-form explanations or context
  • Metadata: Structured data like user ID, session ID, or feature flags
See Capture user feedback in the Instrument section for implementation details.

Build datasets from feedback

Once you’ve captured feedback, use it to create evaluation datasets:

Filter by feedback scores

Use the filter menu to find traces with specific feedback:
WHERE scores.user_rating > 0.8
WHERE metadata.thumbs_up = true
WHERE comment IS NOT NULL AND scores.correctness < 0.5

Copy to datasets

After filtering:
  1. Select the traces you want to include.
  2. Select Add to dataset.
  3. Choose an existing dataset or create a new one.
This workflow lets you build “golden datasets” from highly-rated examples or create test suites from problematic cases.

Generate with Loop

Ask Loop to create datasets based on feedback patterns: Example queries:
  • “Create a dataset from logs with positive feedback”
  • “Generate a dataset from user corrections”
  • “Build a dataset from cases where users clicked thumbs down”

Use feedback for human review

Production feedback complements internal review:

Configure review scores

Set up review scores that match your production feedback. For example, if you capture thumbs up/down in production, configure a matching categorical score for internal review. This consistency lets you compare user feedback with expert assessments.

Review low-scoring traces

Filter for traces with poor user feedback and enter review mode:
  1. Apply filter: WHERE scores.user_rating < 0.3 (SQL) or filter: scores.user_rating < 0.3 (BTQL).
  2. Enter Review mode.
  3. Add internal scores and comments.
  4. Update expected values.
This helps you understand why users were dissatisfied and what the correct output should be.

Track feedback patterns

Use dashboards to monitor feedback trends:
  • User satisfaction over time
  • Feedback distribution by feature or user segment
  • Correlation between automated scores and user feedback
  • Common feedback themes (via comments)
See Monitor with dashboards for details on creating custom charts.

Iterate on improvements

Close the feedback loop:
  1. Capture feedback from production users.
  2. Annotate traces with expected values and labels.
  3. Build datasets from annotated examples.
  4. Evaluate changes using those datasets.
  5. Deploy improvements and monitor feedback again.
This cycle ensures you’re optimizing for what users actually care about.

Next steps