> ## Documentation Index
> Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Understanding Experiment Score Aggregation: Simple vs

export const plans_0 = "Any"

export const deployments_0 = "Any"

export const data_plane_version_0 = undefined

export const use_case_0 = undefined

<Note>
  **Applies to:**

  * Plan - {plans_0}
  * Deployment - {deployments_0}
  * {data_plane_version_0}
  * {use_case_0}
</Note>

## Summary

**Goal:** Calculate weighted averages for grouped experiments based on dataset size instead of simple mean aggregation.

**Features:** SQL/BTQL queries, data export via API, external computation.

## Problem

The Aggregate Scores UI calculates grouped experiment averages as a simple mean (average of averages), not weighted by the number of examples in each experiment. This produces inaccurate comparisons when experiment datasets have different sizes.

## Configuration Steps

### Option 1: SQL/BTQL Query

Use SQL to compute weighted average: `sum(avg_score * count) / sum(count)` grouped by your field.

```sql theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
SELECT
  prompt_id,
  SUM(avg_score * count) / SUM(count) AS weighted_avg
FROM experiments
WHERE project_id = 'your_project_id'
GROUP BY prompt_id

```

Reference: [SQL queries - Braintrust](/reference/sql)

### Option 2: Export and Calculate Externally

Export grouped results as CSV/JSON and calculate weighted mean externally.

```text theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
import pandas as pd

# Load exported experiment data
df = pd.read_csv('experiments.csv')

# Calculate weighted average by group
weighted_avg = (
    df.groupby('prompt_id')
    .apply(lambda x: (x['avg_score'] * x['count']).sum() / x['count'].sum())
)

```

Reference: [Interpret evaluation results - Braintrust](/evaluate/interpret-results)

### Option 3: Python SDK

Fetch experiment data via SDK and compute weighted averages programmatically.

```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
from braintrust import login

login()
experiments = braintrust.load_project('project_name').experiments()

# Group by prompt_id and calculate weighted average
from collections import defaultdict
groups = defaultdict(lambda: {'total_weighted': 0, 'total_count': 0})

for exp in experiments:
    prompt_id = exp.metadata.get('prompt_id')
    avg_score = exp.scores.get('accuracy')
    count = exp.num_examples

    groups[prompt_id]['total_weighted'] += avg_score * count
    groups[prompt_id]['total_count'] += count

for prompt_id, data in groups.items():
    weighted_avg = data['total_weighted'] / data['total_count']
    print(f"{prompt_id}: {weighted_avg}")

```

## When to Use Weighted vs Simple Averages

* **Weighted average:** Use when experiment datasets have different sizes and you need accurate overall performance metrics for prompt comparison
* **Simple average:** Use when all experiments have similar dataset sizes or when each experiment result should have equal influence regardless of size
