Skip to main content
Applies to:
  • Plan -
  • Deployment -

Summary

Issue: Iterating over a Braintrust dataset with for row in dataset fails with a 500 Internal Server Error from the /btql endpoint, raising braintrust.util.AugmentedHTTPError. Cause: The default batch size of 1000 rows per paginated BTQL request can hit a transient backend timeout (e.g., S3 connectivity issue on a storage node), and the Python SDK does not automatically retry on 5xx responses. Resolution: Retry the evaluation — transient errors typically resolve on their own. If the error recurs, use dataset.fetch(batch_size=500) to reduce the size of each paginated request.

Resolution steps

If the error occurred once

Step 1: Retry the evaluation

Re-run the eval without changes. A transient backend timeout is not caused by your code or dataset size and should not persist.

If the error recurs intermittently

Step 1: Replace dataset iteration with fetch(batch_size=...)

Replace for row in dataset with an explicit fetch() call using a smaller batch size. Start with 500; reduce to 100200 if errors continue.
# Instead of: for row in dataset
dataset_rows = list(dataset.fetch(batch_size=500))

Step 2: Add retry logic for 5xx errors

The SDK does not retry 5xx responses automatically. Wrap the fetch in a retry loop to handle transient failures without aborting the eval.
import time
from braintrust.util import AugmentedHTTPError

def fetch_with_retry(dataset, batch_size=500, retries=3, delay=5):
    for attempt in range(retries):
        try:
            return list(dataset.fetch(batch_size=batch_size))
        except AugmentedHTTPError as e:
            if attempt < retries - 1:
                time.sleep(delay)
            else:
                raise