Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Configuring Docker Compose health checks with custom scripts to trigger automatic container restarts before Prometheus alerts fire

Why I Built Custom Health Checks Before Prometheus Could Alert

I run about 20 Docker containers on my Proxmox setup—databases, web services, automation tools, and monitoring stacks. For a long time, I relied on Prometheus and Alertmanager to tell me when something went wrong. The problem was that by the time an alert fired, the service had already been down for minutes. Users noticed before I did.

I needed containers to detect their own problems and restart themselves before Prometheus even knew there was an issue. That meant writing custom health check scripts that went beyond simple TCP port checks.

What Docker Health Checks Actually Do

Docker's built-in health check mechanism runs a command inside the container at regular intervals. If the command exits with code 0, the container is marked healthy. Any other exit code marks it unhealthy. After a defined number of consecutive failures, Docker can trigger a restart policy.

The key difference from external monitoring: the check runs inside the container, using the same environment and dependencies the service relies on. If the database client can't connect, the health check fails—even if the port is technically open.

My Real Setup

I define health checks directly in my docker-compose.yml files. Here's the structure I use:

services:
  myservice:
    image: myimage:latest
    healthcheck:
      test: ["CMD", "/app/healthcheck.sh"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 20s
    restart: unless-stopped

The restart: unless-stopped policy ensures Docker automatically restarts the container after health check failures accumulate.

Parameters I Actually Use

  • interval: How often to run the check. I use 30 seconds for most services, 15 seconds for critical ones.
  • timeout: Maximum time the check can take. I set this to 10 seconds to catch hung processes.
  • retries: Number of failures before marking unhealthy. I use 3 to avoid false positives from transient issues.
  • start_period: Grace period after container start. Essential for services like PostgreSQL that need time to initialize.

Custom Scripts That Worked

I write health check scripts in bash and mount them into containers. Here are patterns I actually use in production.

PostgreSQL Database Check

I run PostgreSQL for several services. The standard pg_isready check only verifies the server is listening—it doesn't confirm the database is actually usable.

#!/bin/bash
# /healthcheck/postgres.sh

psql -U $POSTGRES_USER -d $POSTGRES_DB -c "SELECT 1" > /dev/null 2>&1
exit $?

This script attempts a real query. If the connection fails, the query times out, or the database is locked, the check fails.

In my compose file:

services:
  postgres:
    image: postgres:16
    volumes:
      - ./healthcheck:/healthcheck
    healthcheck:
      test: ["CMD", "/healthcheck/postgres.sh"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

The 40-second start period gives PostgreSQL time to run migrations and accept connections.

Web Service with Dependency Checks

I have a custom web service that depends on both PostgreSQL and Redis. The service itself might be running, but if either dependency is down, it should be marked unhealthy.

#!/bin/bash
# /healthcheck/webapp.sh

# Check HTTP endpoint
curl -sf http://localhost:8080/health > /dev/null || exit 1

# Verify database connection
psql -h postgres -U $DB_USER -d $DB_NAME -c "SELECT 1" > /dev/null 2>&1 || exit 1

# Verify Redis connection
redis-cli -h redis ping > /dev/null 2>&1 || exit 1

exit 0

This script checks three things: the HTTP endpoint responds, the database accepts queries, and Redis is reachable. If any check fails, the container is marked unhealthy and restarted.

n8n Workflow Automation

I use n8n for automation workflows. The default health check just verifies the web interface loads—it doesn't confirm workflows can actually execute.

#!/bin/bash
# /healthcheck/n8n.sh

# Check web interface
curl -sf http://localhost:5678/healthz > /dev/null || exit 1

# Verify database connection (n8n uses PostgreSQL)
psql -h postgres -U $N8N_DB_USER -d $N8N_DB_NAME -c "SELECT COUNT(*) FROM workflow" > /dev/null 2>&1 || exit 1

exit 0

The second check queries the workflow table. If the database is corrupted or locked, this fails before any workflow tries to run.

What Didn't Work

Using curl Without Installing It

My first attempt at a health check for a Node.js service used curl to hit an HTTP endpoint. The container kept showing as unhealthy, even though the service was clearly running.

The problem: the Alpine-based image didn't include curl. Docker was silently failing the health check because the command didn't exist.

I confirmed this by running:

docker exec -it container_name sh
curl http://localhost:3000/health

Result: sh: curl: not found

I added curl to the Dockerfile:

RUN apk add --no-cache curl

Lesson: Always test health check commands manually inside the container before deploying.

Checking Ports Instead of Functionality

I tried using nc -z localhost 5432 to check if PostgreSQL was running. The port was open, but the database was stuck in recovery mode from a bad shutdown. The health check passed, but queries failed.

Port checks are useful for network-level verification, but they don't confirm the service is actually usable. I switched to query-based checks for all databases.

Too-Short Start Periods

I initially set start_period: 10s for a PostgreSQL container. During startup, the database runs initialization scripts that take 30 seconds. Docker marked the container unhealthy before it finished starting, triggering unnecessary restarts.

I increased the start period to 40 seconds and the problem disappeared. Now I always check logs to see how long initialization actually takes before setting this value.

Integrating with Service Dependencies

Docker Compose supports waiting for health checks before starting dependent services. I use this to ensure databases are ready before web services start.

services:
  postgres:
    image: postgres:16
    healthcheck:
      test: ["CMD", "/healthcheck/postgres.sh"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  webapp:
    image: mywebapp:latest
    depends_on:
      postgres:
        condition: service_healthy

Without condition: service_healthy, Compose just waits for the PostgreSQL container to start—not for the database to be ready. With it, the webapp won't start until PostgreSQL passes its health check.

Monitoring Health Check Status

I check container health using:

docker compose ps

Output shows health status in the STATUS column:

NAME       STATUS
postgres   Up 5 minutes (healthy)
webapp     Up 3 minutes (healthy)

For detailed health check logs:

docker inspect --format='{{json .State.Health}}' container_name | jq

This shows the last few health check results, including exit codes and error messages.

How This Reduced Downtime

Before implementing custom health checks, my average time to detect and restart a failed service was 3-5 minutes. Prometheus scrapes every 30 seconds, then waits for multiple failed scrapes before alerting. By the time I got the alert and manually restarted the container, users had already noticed.

With health checks, containers restart themselves within 90 seconds of the first failure (3 retries × 30-second interval). Prometheus alerts now fire as informational notices, not urgent incidents.

I've also eliminated several classes of "running but broken" states—services that appear up but can't actually process requests due to database locks, disk full conditions, or corrupted state files.

Key Takeaways

  • Health checks should test actual functionality, not just port availability
  • Always verify health check commands work inside the container before deploying
  • Set start_period based on real initialization times, not guesses
  • Use depends_on with service_healthy to enforce startup order
  • Custom scripts let you check dependencies and application state, not just process status
  • Docker's automatic restart on health check failure catches issues faster than external monitoring

Current Limitations

Health checks don't solve every problem. They can't detect:

  • Slow performance degradation (response times increasing but still passing)
  • Resource exhaustion that hasn't caused failures yet
  • External dependency issues that don't immediately break functionality

I still use Prometheus for these scenarios. Health checks are a first line of defense—they catch hard failures and restart containers before monitoring systems notice. Prometheus catches everything else.