Troubleshooting 500 errors in self-hosted data planes

Applies to:

Plan -
Deployment -

Summary 500 errors and slow page loads in self-hosted Braintrust deployments are typically caused by event loop blocking or resource exhaustion in API handler instances, preventing requests from being processed properly. This occurs when the Node.js process becomes overwhelmed and cannot handle incoming requests efficiently. The issue can be resolved by enabling telemetry monitoring, checking event loop delay metrics in DataDog, and scaling up API handler instances to distribute load.

Resolution Steps

Step 1: Enable control plane telemetry

Set the CONTROL_PLANE_TELEMETRY environment variable on your API handler to enable detailed metrics collection.

CONTROL_PLANE_TELEMETRY=Status,Metrics,Usage

Deploy this configuration change via your Helm chart or Terraform module.

Step 2: Check event loop delay metrics

Query DataDog for the event loop delay metric to identify blocking operations in your API handler instances. High event loop delay (>100ms) indicates the Node.js process is blocked and cannot handle requests efficiently.

Step 3: Scale API handler instances

Increase the number of API handler replicas to distribute load and improve reliability.

apiHandler:
  replicas: 3  # Increase from current value

Apply this change through your Helm values or Terraform configuration.

Step 4: Monitor performance metrics

Track these key metrics in DataDog after scaling:

Event loop delay (should decrease to <100ms)
Memory usage (should remain stable)
Request latency (should improve)
Error rates (500 errors should decrease)

Step 5: Verify resolution

Test Braintrust links and page loads to confirm 500 errors are resolved and performance has improved. Continue monitoring metrics throughout the day to ensure stability.

Trial quota limits and monthly reset cycle

Troubleshooting Organization Changes in Self-Hosted Data

⌘I

​Resolution Steps

​Step 1: Enable control plane telemetry

​Step 2: Check event loop delay metrics

​Step 3: Scale API handler instances

​Step 4: Monitor performance metrics

​Step 5: Verify resolution