Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Setting Up Docker Compose Health Checks with Automatic Rollback on Failed Deployments

Why I Started Looking Into Docker Health Checks

I run most of my services through Docker Compose on Proxmox VMs. For a long time, I relied on Docker's default behavior—if the container process was running, I assumed everything was fine. That worked until it didn't.

One morning, I noticed my n8n automation workflows had stopped executing. The container was "up," but the web interface wouldn't load. I SSH'd in, checked logs, and found the Node.js process had hung during startup. Docker had no idea because it only cared that the process existed, not whether it was actually functional.

That's when I realized I needed proper health checks—not just for monitoring, but to prevent dependent services from starting against broken backends.

What I Actually Use Health Checks For

In my setup, health checks serve two purposes:

  • Preventing cascading failures when a service isn't ready
  • Enabling automatic restarts when a container becomes unresponsive

I don't use them for every container. Static services like Nginx or databases with built-in retry logic don't need them. But for application containers that depend on each other—like my n8n instance connecting to PostgreSQL, or my custom scraping tools hitting APIs—health checks prevent startup race conditions.

My Basic Health Check Setup

Here's a real example from my n8n configuration:

services:
  n8n:
    image: n8nio/n8n:latest
    environment:
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:5678/healthz"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 40s
    depends_on:
      postgres:
        condition: service_healthy

  postgres:
    image: postgres:15
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "n8n"]
      interval: 10s
      timeout: 3s
      retries: 5

The key here is depends_on with condition: service_healthy. Without that, n8n would try to connect to PostgreSQL before the database was ready, fail, and enter a restart loop.

Why I Use wget Instead of curl

Most examples online use curl, but I learned the hard way that many minimal Docker images don't include it. Alpine-based containers especially. I spent an hour debugging why my health checks were failing before realizing the container didn't have curl installed.

I switched to wget --spider -q because:

  • It's more commonly pre-installed in Alpine images
  • The --spider flag prevents downloading the response body
  • The -q flag suppresses output, keeping logs clean

If neither wget nor curl is available, I add this to the Dockerfile:

RUN apk add --no-cache wget

What Didn't Work: Overly Aggressive Timeouts

My first attempt at health checks used a 10-second interval with a 5-second timeout. That caused problems during high load—the container would occasionally take 6-7 seconds to respond, triggering false failures.

I adjusted to:

  • interval: 30s for most services (no need to hammer them)
  • timeout: 10s to account for slow responses under load
  • start_period: 40s for services with long initialization (like n8n with database migrations)

The start_period is critical. It gives the container time to fully initialize before health checks start counting toward the retries limit.

Debugging Unhealthy Containers

When a container shows as "unhealthy," Docker's output is frustratingly vague. Here's how I actually debug it:

1. Check the Health Status Details

docker inspect --format='{{json .State.Health}}' container_name | jq

This shows the last few health check results and their exit codes. If the command is failing, you'll see the error here.

2. Run the Health Check Command Manually

docker exec -it container_name sh
wget --spider -q http://localhost:5678/healthz
echo $?

If the exit code is non-zero, the health check is legitimately failing. If it returns 0, the issue is with the health check configuration, not the service.

3. Common Failure Reasons I've Hit

  • Missing command: The health check tool (wget, curl, pg_isready) isn't installed
  • Wrong port: The service binds to a different port than the health check expects
  • Startup delay: The service isn't ready within start_period
  • Network isolation: The health check tries to hit localhost, but the service binds to 0.0.0.0 or a specific IP

Health Checks for PostgreSQL

For PostgreSQL, I use pg_isready instead of a TCP connection check:

healthcheck:
  test: ["CMD", "pg_isready", "-U", "postgres"]
  interval: 10s
  timeout: 3s
  retries: 5

This actually verifies the database is accepting connections, not just that the port is open. I tried using nc -z localhost 5432 at first, but that passed even when PostgreSQL was in recovery mode after a crash.

When I Don't Use Health Checks

I don't add health checks to:

  • Nginx/Traefik: These fail fast and loudly if misconfigured
  • Redis: My use case doesn't require strict readiness checks
  • One-off containers: If it's not part of a dependency chain, I skip it

Health checks add complexity. If a service doesn't have downstream dependencies, the default Docker behavior is fine.

Handling Rollbacks When Health Checks Fail

Health checks alone don't rollback deployments—they just mark containers as unhealthy. To actually rollback, I wrote a simple script that runs after docker-compose up -d:

#!/bin/bash

SERVICE="n8n"
MAX_WAIT=60

for i in $(seq 1 $MAX_WAIT); do
  STATUS=$(docker inspect --format='{{.State.Health.Status}}' $SERVICE 2>/dev/null)
  
  if [ "$STATUS" = "healthy" ]; then
    echo "Service is healthy"
    exit 0
  fi
  
  sleep 1
done

echo "Service failed health check. Rolling back..."
docker-compose down
docker-compose up -d --no-deps $SERVICE

This is basic, but it works for my setup. In production, I'd use Watchtower or a CI/CD pipeline with proper versioning.

Key Takeaways

  • Health checks prevent dependency race conditions, not just monitoring
  • Use wget or verify the health check tool exists in the container
  • Set start_period high enough for slow-starting services
  • Debug with docker inspect and manual command execution
  • Don't add health checks unless they serve a purpose

For my self-hosted stack, health checks have eliminated most startup failures. The initial setup took some trial and error, but now my services start reliably, even after a full Proxmox reboot.