> ## Documentation Index
> Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Troubleshooting 500 errors in self-hosted data planes

export const plans_0 = "Enterprise"

export const deployments_0 = "Self-hosted"

export const data_plane_version_0 = undefined

export const use_case_0 = "Use case - Self-hosted data plane experiencing performance issues"

<Note>
  **Applies to:**

  * Plan - {plans_0}
  * Deployment - {deployments_0}
  * {data_plane_version_0}
  * {use_case_0}
</Note>

Summary

500 errors and slow page loads in self-hosted Braintrust deployments are typically caused by event loop blocking or resource exhaustion in API handler instances, preventing requests from being processed properly. This occurs when the Node.js process becomes overwhelmed and cannot handle incoming requests efficiently. The issue can be resolved by enabling telemetry monitoring, checking event loop delay metrics in DataDog, and scaling up API handler instances to distribute load.

## Resolution Steps

### Step 1: Enable control plane telemetry

Set the `CONTROL_PLANE_TELEMETRY` environment variable on your API handler to enable detailed metrics collection.

```text theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
CONTROL_PLANE_TELEMETRY=Status,Metrics,Usage

```

Deploy this configuration change via your Helm chart or Terraform module.

### Step 2: Check event loop delay metrics

Query DataDog for the `event loop delay` metric to identify blocking operations in your API handler instances.

High event loop delay (>100ms) indicates the Node.js process is blocked and cannot handle requests efficiently.

### Step 3: Scale API handler instances

Increase the number of API handler replicas to distribute load and improve reliability.

```text theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
apiHandler:
  replicas: 3  # Increase from current value

```

Apply this change through your Helm values or Terraform configuration.

### Step 4: Monitor performance metrics

Track these key metrics in DataDog after scaling:

* Event loop delay (should decrease to `<100`ms)
* Memory usage (should remain stable)
* Request latency (should improve)
* Error rates (500 errors should decrease)

### Step 5: Verify resolution

Test Braintrust links and page loads to confirm 500 errors are resolved and performance has improved.

Continue monitoring metrics throughout the day to ensure stability.
