Skip to main content
Applies to:


Summary

Kubernetes readiness probes hanging on the /status endpoint typically occur due to database connection pool exhaustion under load. When all available database connections are occupied, the health check waits indefinitely for a connection to verify database health, causing probe timeouts and pod failures. The resolution involves increasing the PG_POOL_CONFIG_MAX_NUM_CLIENTS environment variable to accommodate the expected load.

Resolution Steps

Step 1: Gather diagnostic information

Ask the customer for the following details:
  • Kubernetes readiness probe timeout settings
  • Which specific health checks are timing out (check pod logs)
  • Resource utilization (CPU/memory) during failures
  • Current PG_POOL_CONFIG_MAX_NUM_CLIENTS setting (if unset, default is 10)
  • Number of API server replicas and expected QPS

Step 2: Identify connection pool exhaustion

Look for patterns indicating database connection pool issues:
  • Normal CPU/memory usage but hanging /status endpoint
  • Manual curl to /status also hangs during load
  • No explicit PG_POOL_CONFIG_MAX_NUM_CLIENTS configuration

Step 3: Configure connection pool size

Instruct the customer to set the database connection pool size based on their deployment method:

For Helm deployments

env:
  - name: PG_POOL_CONFIG_MAX_NUM_CLIENTS
    value: "50" #arbitrary number - select a value that makes sense to your use-case

For direct environment variable configuration

PG_POOL_CONFIG_MAX_NUM_CLIENTS=50

Sizing guideline: Start with 5-10 connections per API server replica, then adjust based on load testing results.

Step 4: Verify the fix

After redeployment, confirm that:
  • The /status endpoint responds consistently under load
  • Readiness probes no longer timeout
  • Pods remain healthy during sustained traffic

If connection pool adjustment doesn’t resolve the issue

Consider these additional factors:
  • Database max_connections limit may need adjustment
  • Other dependencies (Redis, S3) may be slow - check for rate limiting
  • Temporarily increase probe timeout and failure threshold for testing