Implementing Container Startup Ordering with Custom Health Gates: Replacing depends_on with TCP Socket Polling for Database-Dependent Services

Why I Built Custom Health Gates Instead of Using depends_on

I run several containerized services on my Proxmox homelab that depend on PostgreSQL and MariaDB. For months, I relied on Docker Compose’s depends_on with the service_healthy condition. It worked most of the time, but I kept hitting edge cases where containers would start before the database was actually ready to accept connections.

The problem wasn’t that the database container was unhealthy—it was that “healthy” didn’t mean “ready for my application.” PostgreSQL might pass its health check but still be running startup scripts or recovering from an unclean shutdown. My n8n instance would crash-loop for 30 seconds until the database caught up. My custom monitoring tools would throw connection errors into logs that I’d have to filter out later.

Implementing Container Startup Ordering with Custom Health Gates: Replacing depends_on with TCP Socket Polling for Database-Dependent Services

I needed something more precise: wait until the actual TCP socket accepts connections, then wait a bit longer to be sure. I didn’t want to modify every application’s startup script or add retry logic to code that shouldn’t need it.

What I Actually Built

I wrote a small bash script that polls a TCP socket until it responds, then adds a configurable delay before exiting successfully. The container runs this script as its entrypoint, blocks until the database is confirmed reachable, then hands off to the real application command.

Here’s the core script I use across multiple services:

#!/bin/bash
set -e

HOST="${DB_WAIT_HOST:-localhost}"
PORT="${DB_WAIT_PORT:-5432}"
TIMEOUT="${DB_WAIT_TIMEOUT:-60}"
DELAY="${DB_WAIT_DELAY:-2}"

echo "Waiting for $HOST:$PORT to be reachable..."

elapsed=0
while ! nc -z "$HOST" "$PORT" 2>/dev/null; do
    if [ $elapsed -ge $TIMEOUT ]; then
        echo "Timeout waiting for $HOST:$PORT"
        exit 1
    fi
    sleep 1
    elapsed=$((elapsed + 1))
done

echo "$HOST:$PORT is reachable. Waiting additional ${DELAY}s for stability..."
sleep "$DELAY"

echo "Starting application: $@"
exec "$@"

I mount this script into containers that need it, set environment variables for the database host and port, and replace the container’s command with the script followed by the original startup command.

Example Docker Compose Setup

For my n8n instance connecting to PostgreSQL:

services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: n8n
      POSTGRES_USER: n8n
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U n8n"]
      interval: 10s
      timeout: 5s
      retries: 5

  n8n:
    image: n8nio/n8n:latest
    environment:
      DB_TYPE: postgresdb
      DB_POSTGRESDB_HOST: postgres
      DB_POSTGRESDB_PORT: 5432
      DB_POSTGRESDB_DATABASE: n8n
      DB_POSTGRESDB_USER: n8n
      DB_POSTGRESDB_PASSWORD: ${DB_PASSWORD}
      DB_WAIT_HOST: postgres
      DB_WAIT_PORT: 5432
      DB_WAIT_TIMEOUT: 90
      DB_WAIT_DELAY: 3
    volumes:
      - ./wait-for-db.sh:/usr/local/bin/wait-for-db.sh:ro
      - n8n_data:/home/node/.n8n
    command: ["/usr/local/bin/wait-for-db.sh", "n8n", "start"]
    depends_on:
      postgres:
        condition: service_healthy

I kept depends_on with service_healthy because it still provides value—it prevents the n8n container from even attempting to start if PostgreSQL hasn’t passed its basic health check. The script adds a second layer: confirming the socket is actually accepting connections before the application tries to use it.

What Actually Worked

The TCP socket polling eliminated startup race conditions completely. I haven’t seen a single connection error in logs since implementing this across my database-dependent containers. The additional stability delay (usually 2-3 seconds) catches cases where the socket opens but the database is still initializing internal state.

Using nc -z (netcat in zero-I/O mode) is lightweight and doesn’t require installing additional packages in most Alpine-based images. It’s already present in the base images I use for PostgreSQL and MariaDB clients.

The script’s timeout and delay are configurable via environment variables, so I can tune them per service without modifying the script itself. For PostgreSQL on my Synology NAS (which is slower to start), I use a 90-second timeout. For MariaDB on Proxmox with NVMe storage, 30 seconds is plenty.

What the Delay Actually Does

The delay after socket confirmation isn’t arbitrary. I tested this by removing it and watching PostgreSQL logs during container restarts. Even after the socket opened, I saw messages like “database system is ready to accept connections” followed by “autovacuum launcher started” a second later. Applications connecting in that window sometimes got “the database system is starting up” errors.

A 2-3 second delay gives PostgreSQL time to finish those final initialization steps. It’s not scientifically derived—I just tested different values until errors stopped appearing.

What Didn’t Work

My first attempt used curl to check an HTTP health endpoint on the database container. This required adding a web server to the database container just for health checks, which felt wrong. It also didn’t solve the problem—HTTP 200 responses came back before the database was ready for connections.

I tried using pg_isready directly in the dependent container’s entrypoint, but that required installing PostgreSQL client tools in every image that needed to wait. For lightweight Alpine containers, this added 30-40MB per image. The netcat approach works with tools already present.

I also experimented with exponential backoff in the polling loop, thinking it would reduce CPU usage during long waits. In practice, checking once per second for 60 seconds uses negligible resources, and the simpler linear approach is easier to reason about when debugging.

The Exec Trap

I initially forgot to use exec "$@" to replace the shell process with the application. This meant the shell stuck around as PID 1, and signals sent to the container (like docker stop) weren’t reaching the actual application. Containers took the full 10-second SIGKILL timeout to stop instead of shutting down gracefully.

Using exec ensures the application becomes PID 1 and receives signals properly.

When This Approach Doesn’t Make Sense

If you’re running in Kubernetes or another orchestrator with proper init containers and readiness probes, use those instead. This solution is specifically for Docker Compose environments where you control the full stack and want simple, reliable startup ordering.

For production systems with complex dependency graphs, you probably want something more sophisticated than bash scripts. But for my homelab services—n8n, Cronicle, monitoring tools—this has been rock-solid for over six months.

Key Takeaways

TCP socket polling is more reliable than health checks alone for database-dependent containers. Health checks tell you the container is alive, but not whether the service inside is ready for your specific use case.

A small stability delay after socket confirmation catches edge cases where the service is technically reachable but still initializing. This is especially true for databases that perform recovery or migration steps on startup.

Using exec to replace the shell process with your application is critical for proper signal handling. Without it, containers won’t shut down cleanly.

Configurable timeouts and delays via environment variables make the same script reusable across services with different startup characteristics. I don’t need separate scripts for fast and slow databases.

This approach works well in homelab environments where you control the entire stack and want simplicity over enterprise-grade orchestration features. It’s solved a real problem for me without adding complexity or dependencies.

Tech Expert & Vibe Coder

Implementing Container Startup Ordering with Custom Health Gates: Replacing depends_on with TCP Socket Polling for Database-Dependent Services

Why I Built Custom Health Gates Instead of Using depends_on