Skip to main content
Applies to:


Summary

Goal: Calculate weighted averages for grouped experiments based on dataset size instead of simple mean aggregation. Features: SQL/BTQL queries, data export via API, external computation.

Problem

The Aggregate Scores UI calculates grouped experiment averages as a simple mean (average of averages), not weighted by the number of examples in each experiment. This produces inaccurate comparisons when experiment datasets have different sizes.

Configuration Steps

Option 1: SQL/BTQL Query

Use SQL to compute weighted average: sum(avg_score * count) / sum(count) grouped by your field.
SELECT
  prompt_id,
  SUM(avg_score * count) / SUM(count) AS weighted_avg
FROM experiments
WHERE project_id = 'your_project_id'
GROUP BY prompt_id

Reference: SQL queries - Braintrust

Option 2: Export and Calculate Externally

Export grouped results as CSV/JSON and calculate weighted mean externally.
import pandas as pd

# Load exported experiment data
df = pd.read_csv('experiments.csv')

# Calculate weighted average by group
weighted_avg = (
    df.groupby('prompt_id')
    .apply(lambda x: (x['avg_score'] * x['count']).sum() / x['count'].sum())
)

Reference: Interpret evaluation results - Braintrust

Option 3: Python SDK

Fetch experiment data via SDK and compute weighted averages programmatically.
from braintrust import login

login()
experiments = braintrust.load_project('project_name').experiments()

# Group by prompt_id and calculate weighted average
from collections import defaultdict
groups = defaultdict(lambda: {'total_weighted': 0, 'total_count': 0})

for exp in experiments:
    prompt_id = exp.metadata.get('prompt_id')
    avg_score = exp.scores.get('accuracy')
    count = exp.num_examples

    groups[prompt_id]['total_weighted'] += avg_score * count
    groups[prompt_id]['total_count'] += count

for prompt_id, data in groups.items():
    weighted_avg = data['total_weighted'] / data['total_count']
    print(f"{prompt_id}: {weighted_avg}")

When to Use Weighted vs Simple Averages

  • Weighted average: Use when experiment datasets have different sizes and you need accurate overall performance metrics for prompt comparison
  • Simple average: Use when all experiments have similar dataset sizes or when each experiment result should have equal influence regardless of size