Kubernetes Readiness Probe Hangs on /status Endpoint

Applies to:

Plan -
Deployment -

Summary

Kubernetes readiness probes hanging on the /status endpoint typically occur due to database connection pool exhaustion under load. When all available database connections are occupied, the health check waits indefinitely for a connection to verify database health, causing probe timeouts and pod failures. The resolution involves increasing the PG_POOL_CONFIG_MAX_NUM_CLIENTS environment variable to accommodate the expected load.

Resolution Steps

Step 1: Gather diagnostic information

Ask the customer for the following details:

Kubernetes readiness probe timeout settings
Which specific health checks are timing out (check pod logs)
Resource utilization (CPU/memory) during failures
Current PG_POOL_CONFIG_MAX_NUM_CLIENTS setting (if unset, default is 10)
Number of API server replicas and expected QPS

Step 2: Identify connection pool exhaustion

Look for patterns indicating database connection pool issues:

Normal CPU/memory usage but hanging /status endpoint
Manual curl to /status also hangs during load
No explicit PG_POOL_CONFIG_MAX_NUM_CLIENTS configuration

Step 3: Configure connection pool size

Instruct the customer to set the database connection pool size based on their deployment method:

For Helm deployments

env:
  - name: PG_POOL_CONFIG_MAX_NUM_CLIENTS
    value: "50" #arbitrary number - select a value that makes sense to your use-case

For direct environment variable configuration

PG_POOL_CONFIG_MAX_NUM_CLIENTS=50

Sizing guideline: Start with 5-10 connections per API server replica, then adjust based on load testing results.

Step 4: Verify the fix

After redeployment, confirm that:

The /status endpoint responds consistently under load
Readiness probes no longer timeout
Pods remain healthy during sustained traffic

If connection pool adjustment doesn’t resolve the issue

Consider these additional factors:

Database max_connections limit may need adjustment
Other dependencies (Redis, S3) may be slow - check for rate limiting
Temporarily increase probe timeout and failure threshold for testing

​Summary

​Resolution Steps

​Step 1: Gather diagnostic information

​Step 2: Identify connection pool exhaustion

​Step 3: Configure connection pool size

​For Helm deployments

​For direct environment variable configuration

​Step 4: Verify the fix

​If connection pool adjustment doesn’t resolve the issue