Why I Built Custom Health Gates Instead of Using depends_on
I run several containerized services on my Proxmox homelab that depend on PostgreSQL and MariaDB. For months, I relied on Docker Compose’s depends_on with the service_healthy condition. It worked most of the time, but I kept hitting edge cases where containers would start before the database was actually ready to accept connections.
The problem wasn’t that the database container was unhealthy—it was that “healthy” didn’t mean “ready for my application.” PostgreSQL might pass its health check but still be running startup scripts or recovering from an unclean shutdown. My n8n instance would crash-loop for 30 seconds until the database caught up. My custom monitoring tools would throw connection errors into logs that I’d have to filter out later.
I needed something more precise: wait until the actual TCP socket accepts connections, then wait a bit longer to be sure. I didn’t want to modify every application’s startup script or add retry logic to code that shouldn’t need it.
What I Actually Built
I wrote a small bash script that polls a TCP socket until it responds, then adds a configurable delay before exiting successfully. The container runs this script as its entrypoint, blocks until the database is confirmed reachable, then hands off to the real application command.
Here’s the core script I use across multiple services:
#!/bin/bash
set -e
HOST="${DB_WAIT_HOST:-localhost}"
PORT="${DB_WAIT_PORT:-5432}"
TIMEOUT="${DB_WAIT_TIMEOUT:-60}"
DELAY="${DB_WAIT_DELAY:-2}"
echo "Waiting for $HOST:$PORT to be reachable..."
elapsed=0
while ! nc -z "$HOST" "$PORT" 2>/dev/null; do
if [ $elapsed -ge $TIMEOUT ]; then
echo "Timeout waiting for $HOST:$PORT"
exit 1
fi
sleep 1
elapsed=$((elapsed + 1))
done
echo "$HOST:$PORT is reachable. Waiting additional ${DELAY}s for stability..."
sleep "$DELAY"
echo "Starting application: $@"
exec "$@"
I mount this script into containers that need it, set environment variables for the database host and port, and replace the container’s command with the script followed by the original startup command.
Example Docker Compose Setup
For my n8n instance connecting to PostgreSQL:
services:
postgres:
image: postgres:15-alpine
environment:
POSTGRES_DB: n8n
POSTGRES_USER: n8n
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U n8n"]
interval: 10s
timeout: 5s
retries: 5
n8n:
image: n8nio/n8n:latest
environment:
DB_TYPE: postgresdb
DB_POSTGRESDB_HOST: postgres
DB_POSTGRESDB_PORT: 5432
DB_POSTGRESDB_DATABASE: n8n
DB_POSTGRESDB_USER: n8n
DB_POSTGRESDB_PASSWORD: ${DB_PASSWORD}
DB_WAIT_HOST: postgres
DB_WAIT_PORT: 5432
DB_WAIT_TIMEOUT: 90
DB_WAIT_DELAY: 3
volumes:
- ./wait-for-db.sh:/usr/local/bin/wait-for-db.sh:ro
- n8n_data:/home/node/.n8n
command: ["/usr/local/bin/wait-for-db.sh", "n8n", "start"]
depends_on:
postgres:
condition: service_healthy
I kept depends_on with service_healthy because it still provides value—it prevents the n8n container from even attempting to start if PostgreSQL hasn’t passed its basic health check. The script adds a second layer: confirming the socket is actually accepting connections before the application tries to use it.
What Actually Worked
The TCP socket polling eliminated startup race conditions completely. I haven’t seen a single connection error in logs since implementing this across my database-dependent containers. The additional stability delay (usually 2-3 seconds) catches cases where the socket opens but the database is still initializing internal state.
Using nc -z (netcat in zero-I/O mode) is lightweight and doesn’t require installing additional packages in most Alpine-based images. It’s already present in the base images I use for PostgreSQL and MariaDB clients.
The script’s timeout and delay are configurable via environment variables, so I can tune them per service without modifying the script itself. For PostgreSQL on my Synology NAS (which is slower to start), I use a 90-second timeout. For MariaDB on Proxmox with NVMe storage, 30 seconds is plenty.
What the Delay Actually Does
The delay after socket confirmation isn’t arbitrary. I tested this by removing it and watching PostgreSQL logs during container restarts. Even after the socket opened, I saw messages like “database system is ready to accept connections” followed by “autovacuum launcher started” a second later. Applications connecting in that window sometimes got “the database system is starting up” errors.
A 2-3 second delay gives PostgreSQL time to finish those final initialization steps. It’s not scientifically derived—I just tested different values until errors stopped appearing.
What Didn’t Work
My first attempt used curl to check an HTTP health endpoint on the database container. This required adding a web server to the database container just for health checks, which felt wrong. It also didn’t solve the problem—HTTP 200 responses came back before the database was ready for connections.
I tried using pg_isready directly in the dependent container’s entrypoint, but that required installing PostgreSQL client tools in every image that needed to wait. For lightweight Alpine containers, this added 30-40MB per image. The netcat approach works with tools already present.
I also experimented with exponential backoff in the polling loop, thinking it would reduce CPU usage during long waits. In practice, checking once per second for 60 seconds uses negligible resources, and the simpler linear approach is easier to reason about when debugging.
The Exec Trap
I initially forgot to use exec "$@" to replace the shell process with the application. This meant the shell stuck around as PID 1, and signals sent to the container (like docker stop) weren’t reaching the actual application. Containers took the full 10-second SIGKILL timeout to stop instead of shutting down gracefully.
Using exec ensures the application becomes PID 1 and receives signals properly.
When This Approach Doesn’t Make Sense
If you’re running in Kubernetes or another orchestrator with proper init containers and readiness probes, use those instead. This solution is specifically for Docker Compose environments where you control the full stack and want simple, reliable startup ordering.
For production systems with complex dependency graphs, you probably want something more sophisticated than bash scripts. But for my homelab services—n8n, Cronicle, monitoring tools—this has been rock-solid for over six months.
Key Takeaways
TCP socket polling is more reliable than health checks alone for database-dependent containers. Health checks tell you the container is alive, but not whether the service inside is ready for your specific use case.
A small stability delay after socket confirmation catches edge cases where the service is technically reachable but still initializing. This is especially true for databases that perform recovery or migration steps on startup.
Using exec to replace the shell process with your application is critical for proper signal handling. Without it, containers won’t shut down cleanly.
Configurable timeouts and delays via environment variables make the same script reusable across services with different startup characteristics. I don’t need separate scripts for fast and slow databases.
This approach works well in homelab environments where you control the entire stack and want simplicity over enterprise-grade orchestration features. It’s solved a real problem for me without adding complexity or dependencies.