Applies to:
Summary
Goal: Calculate weighted averages for grouped experiments based on dataset size instead of simple mean aggregation. Features: SQL/BTQL queries, data export via API, external computation.Problem
The Aggregate Scores UI calculates grouped experiment averages as a simple mean (average of averages), not weighted by the number of examples in each experiment. This produces inaccurate comparisons when experiment datasets have different sizes.Configuration Steps
Option 1: SQL/BTQL Query
Use SQL to compute weighted average:sum(avg_score * count) / sum(count) grouped by your field.
Option 2: Export and Calculate Externally
Export grouped results as CSV/JSON and calculate weighted mean externally.Option 3: Python SDK
Fetch experiment data via SDK and compute weighted averages programmatically.When to Use Weighted vs Simple Averages
- Weighted average: Use when experiment datasets have different sizes and you need accurate overall performance metrics for prompt comparison
- Simple average: Use when all experiments have similar dataset sizes or when each experiment result should have equal influence regardless of size