Understanding Experiment Score Aggregation: Simple vs

Applies to:

Plan -
Deployment -

Summary

Goal: Calculate weighted averages for grouped experiments based on dataset size instead of simple mean aggregation. Features: SQL/BTQL queries, data export via API, external computation.

Problem

The Aggregate Scores UI calculates grouped experiment averages as a simple mean (average of averages), not weighted by the number of examples in each experiment. This produces inaccurate comparisons when experiment datasets have different sizes.

Configuration Steps

Option 1: SQL/BTQL Query

Use SQL to compute weighted average: sum(avg_score * count) / sum(count) grouped by your field.

SELECT
  prompt_id,
  SUM(avg_score * count) / SUM(count) AS weighted_avg
FROM experiments
WHERE project_id = 'your_project_id'
GROUP BY prompt_id

Reference: SQL queries - Braintrust

Option 2: Export and Calculate Externally

Export grouped results as CSV/JSON and calculate weighted mean externally.

import pandas as pd

# Load exported experiment data
df = pd.read_csv('experiments.csv')

# Calculate weighted average by group
weighted_avg = (
    df.groupby('prompt_id')
    .apply(lambda x: (x['avg_score'] * x['count']).sum() / x['count'].sum())
)

Reference: Interpret evaluation results - Braintrust

Option 3: Python SDK

Fetch experiment data via SDK and compute weighted averages programmatically.

from braintrust import login

login()
experiments = braintrust.load_project('project_name').experiments()

# Group by prompt_id and calculate weighted average
from collections import defaultdict
groups = defaultdict(lambda: {'total_weighted': 0, 'total_count': 0})

for exp in experiments:
    prompt_id = exp.metadata.get('prompt_id')
    avg_score = exp.scores.get('accuracy')
    count = exp.num_examples

    groups[prompt_id]['total_weighted'] += avg_score * count
    groups[prompt_id]['total_count'] += count

for prompt_id, data in groups.items():
    weighted_avg = data['total_weighted'] / data['total_count']
    print(f"{prompt_id}: {weighted_avg}")

When to Use Weighted vs Simple Averages

Weighted average: Use when experiment datasets have different sizes and you need accurate overall performance metrics for prompt comparison
Simple average: Use when all experiments have similar dataset sizes or when each experiment result should have equal influence regardless of size

Understanding Brainstore query optimization warning logs

Understanding traces vs spans in SQL & BTQL queries

⌘I

​Summary

​Problem

​Configuration Steps

​Option 1: SQL/BTQL Query

​Option 2: Export and Calculate Externally

​Option 3: Python SDK

​When to Use Weighted vs Simple Averages