500 error when iterating over a dataset

Applies to:

Plan -
Deployment -

Summary

Issue: Iterating over a Braintrust dataset with for row in dataset fails with a 500 Internal Server Error from the /btql endpoint, raising braintrust.util.AugmentedHTTPError. Cause: The default batch size of 1000 rows per paginated BTQL request can hit a transient backend timeout (e.g., S3 connectivity issue on a storage node), and the Python SDK does not automatically retry on 5xx responses. Resolution: Retry the evaluation — transient errors typically resolve on their own. If the error recurs, use dataset.fetch(batch_size=500) to reduce the size of each paginated request.

Resolution steps

If the error occurred once

Step 1: Retry the evaluation

Re-run the eval without changes. A transient backend timeout is not caused by your code or dataset size and should not persist.

If the error recurs intermittently

Step 1: Replace dataset iteration with `fetch(batch_size=...)`

Replace for row in dataset with an explicit fetch() call using a smaller batch size. Start with 500; reduce to 100–200 if errors continue.

# Instead of: for row in dataset
dataset_rows = list(dataset.fetch(batch_size=500))

Step 2: Add retry logic for 5xx errors

The SDK does not retry 5xx responses automatically. Wrap the fetch in a retry loop to handle transient failures without aborting the eval.

import time
from braintrust.util import AugmentedHTTPError

def fetch_with_retry(dataset, batch_size=500, retries=3, delay=5):
    for attempt in range(retries):
        try:
            return list(dataset.fetch(batch_size=batch_size))
        except AugmentedHTTPError as e:
            if attempt < retries - 1:
                time.sleep(delay)
            else:
                raise

​Summary

​Resolution steps

​If the error occurred once

​Step 1: Retry the evaluation

​If the error recurs intermittently

​Step 1: Replace dataset iteration with fetch(batch_size=...)

​Step 2: Add retry logic for 5xx errors

Summary

Resolution steps

If the error occurred once

Step 1: Retry the evaluation

If the error recurs intermittently

Step 1: Replace dataset iteration with `fetch(batch_size=...)`

Step 2: Add retry logic for 5xx errors