Why I Started Looking Into Docker Health Checks
I run most of my services through Docker Compose on Proxmox VMs. For a long time, I relied on Docker's default behavior—if the container process was running, I assumed everything was fine. That worked until it didn't.
One morning, I noticed my n8n automation workflows had stopped executing. The container was "up," but the web interface wouldn't load. I SSH'd in, checked logs, and found the Node.js process had hung during startup. Docker had no idea because it only cared that the process existed, not whether it was actually functional.
That's when I realized I needed proper health checks—not just for monitoring, but to prevent dependent services from starting against broken backends.
What I Actually Use Health Checks For
In my setup, health checks serve two purposes:
- Preventing cascading failures when a service isn't ready
- Enabling automatic restarts when a container becomes unresponsive
I don't use them for every container. Static services like Nginx or databases with built-in retry logic don't need them. But for application containers that depend on each other—like my n8n instance connecting to PostgreSQL, or my custom scraping tools hitting APIs—health checks prevent startup race conditions.
My Basic Health Check Setup
Here's a real example from my n8n configuration:
services:
n8n:
image: n8nio/n8n:latest
environment:
- DB_TYPE=postgresdb
- DB_POSTGRESDB_HOST=postgres
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:5678/healthz"]
interval: 30s
timeout: 5s
retries: 3
start_period: 40s
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:15
healthcheck:
test: ["CMD", "pg_isready", "-U", "n8n"]
interval: 10s
timeout: 3s
retries: 5
The key here is depends_on with condition: service_healthy. Without that, n8n would try to connect to PostgreSQL before the database was ready, fail, and enter a restart loop.
Why I Use wget Instead of curl
Most examples online use curl, but I learned the hard way that many minimal Docker images don't include it. Alpine-based containers especially. I spent an hour debugging why my health checks were failing before realizing the container didn't have curl installed.
I switched to wget --spider -q because:
- It's more commonly pre-installed in Alpine images
- The
--spiderflag prevents downloading the response body - The
-qflag suppresses output, keeping logs clean
If neither wget nor curl is available, I add this to the Dockerfile:
RUN apk add --no-cache wget
What Didn't Work: Overly Aggressive Timeouts
My first attempt at health checks used a 10-second interval with a 5-second timeout. That caused problems during high load—the container would occasionally take 6-7 seconds to respond, triggering false failures.
I adjusted to:
- interval: 30s for most services (no need to hammer them)
- timeout: 10s to account for slow responses under load
- start_period: 40s for services with long initialization (like n8n with database migrations)
The start_period is critical. It gives the container time to fully initialize before health checks start counting toward the retries limit.
Debugging Unhealthy Containers
When a container shows as "unhealthy," Docker's output is frustratingly vague. Here's how I actually debug it:
1. Check the Health Status Details
docker inspect --format='{{json .State.Health}}' container_name | jq
This shows the last few health check results and their exit codes. If the command is failing, you'll see the error here.
2. Run the Health Check Command Manually
docker exec -it container_name sh wget --spider -q http://localhost:5678/healthz echo $?
If the exit code is non-zero, the health check is legitimately failing. If it returns 0, the issue is with the health check configuration, not the service.
3. Common Failure Reasons I've Hit
- Missing command: The health check tool (
wget,curl,pg_isready) isn't installed - Wrong port: The service binds to a different port than the health check expects
- Startup delay: The service isn't ready within
start_period - Network isolation: The health check tries to hit
localhost, but the service binds to0.0.0.0or a specific IP
Health Checks for PostgreSQL
For PostgreSQL, I use pg_isready instead of a TCP connection check:
healthcheck: test: ["CMD", "pg_isready", "-U", "postgres"] interval: 10s timeout: 3s retries: 5
This actually verifies the database is accepting connections, not just that the port is open. I tried using nc -z localhost 5432 at first, but that passed even when PostgreSQL was in recovery mode after a crash.
When I Don't Use Health Checks
I don't add health checks to:
- Nginx/Traefik: These fail fast and loudly if misconfigured
- Redis: My use case doesn't require strict readiness checks
- One-off containers: If it's not part of a dependency chain, I skip it
Health checks add complexity. If a service doesn't have downstream dependencies, the default Docker behavior is fine.
Handling Rollbacks When Health Checks Fail
Health checks alone don't rollback deployments—they just mark containers as unhealthy. To actually rollback, I wrote a simple script that runs after docker-compose up -d:
#!/bin/bash
SERVICE="n8n"
MAX_WAIT=60
for i in $(seq 1 $MAX_WAIT); do
STATUS=$(docker inspect --format='{{.State.Health.Status}}' $SERVICE 2>/dev/null)
if [ "$STATUS" = "healthy" ]; then
echo "Service is healthy"
exit 0
fi
sleep 1
done
echo "Service failed health check. Rolling back..."
docker-compose down
docker-compose up -d --no-deps $SERVICE
This is basic, but it works for my setup. In production, I'd use Watchtower or a CI/CD pipeline with proper versioning.
Key Takeaways
- Health checks prevent dependency race conditions, not just monitoring
- Use
wgetor verify the health check tool exists in the container - Set
start_periodhigh enough for slow-starting services - Debug with
docker inspectand manual command execution - Don't add health checks unless they serve a purpose
For my self-hosted stack, health checks have eliminated most startup failures. The initial setup took some trial and error, but now my services start reliably, even after a full Proxmox reboot.