Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Creating Docker Healthchecks That Actually Work: Testing Beyond Port Availability with Custom Scripts

Why I Started Looking Beyond Port Checks

I run a mix of self-hosted services on Proxmox—mostly in Docker containers managed through Portainer. For a long time, I relied on basic healthchecks that just pinged a port or ran curl localhost:8080. That worked fine until it didn't.

One morning, my n8n instance showed as "healthy" in Docker, but workflows weren't executing. The container was running, the port was open, but the internal worker process had crashed. Docker had no idea because my healthcheck only verified the HTTP server was listening.

That's when I realized: checking if a port is open tells you almost nothing about whether your application actually works.

What I Actually Check Now

I've rebuilt my healthchecks around one principle: test what matters for the specific service. For a web app, that means checking if it can serve requests. For a database, it means verifying it accepts queries. For n8n, it means confirming the workflow engine is alive.

Here's what I check in my actual setup:

n8n Workflow Automation

My n8n container runs workflows that handle monitoring alerts and data processing. A dead worker means everything stops, even if the web UI loads.

I wrote a small healthcheck script that hits an internal endpoint and verifies the response contains expected data:

#!/bin/sh
response=$(curl -sf http://localhost:5678/healthz)
if echo "$response" | grep -q "ok"; then
  exit 0
else
  exit 1
fi

In my Dockerfile:

HEALTHCHECK --interval=30s --timeout=5s --start-period=40s --retries=3 \
  CMD /app/healthcheck.sh

The --start-period=40s is critical. n8n takes time to initialize its database connection and load workflows. Without this grace period, Docker would mark it unhealthy during startup and restart it in a loop.

PostgreSQL Database

For my PostgreSQL container, I don't just check if the port is open. I run an actual query to confirm the database can process requests:

HEALTHCHECK --interval=10s --timeout=5s --start-period=30s \
  CMD pg_isready -U myuser -d mydb && psql -U myuser -d mydb -c "SELECT 1" > /dev/null || exit 1

The pg_isready check is fast, but the SELECT 1 query proves the database can actually execute SQL. I've had cases where PostgreSQL accepted connections but couldn't run queries due to disk issues.

Custom Python Services

I run a few custom Python scripts as long-running containers—mostly for scraping data and feeding it into processing pipelines. These don't have HTTP endpoints, so I check if the main process is still running and hasn't exited with an error.

Healthcheck script:

#!/bin/sh
if pgrep -f "python /app/main.py" > /dev/null; then
  exit 0
else
  exit 1
fi

This isn't perfect—it only confirms the process exists, not that it's functioning correctly. But for simple scripts, it catches the most common failure: the process crashing.

What Didn't Work

I made several mistakes before landing on checks that actually helped:

Checking Too Frequently

I initially set healthchecks to run every 10 seconds across all containers. This created unnecessary load, especially on services that do heavy initialization. I saw PostgreSQL spike CPU during healthchecks because I was running queries too often while it was under load.

Now I use longer intervals (30s for most services) unless the container is critical and needs faster detection.

Not Accounting for Startup Time

My first n8n healthcheck had no --start-period. Docker marked it unhealthy immediately because it took 30+ seconds to start. The container would restart, fail again, and loop endlessly.

I now set --start-period to at least double the observed startup time. Better to wait a bit longer than trigger false restarts.

Using Generic Checks for Everything

I tried using curl -f http://localhost:PORT for every web service. This failed for applications that return HTTP 200 even when broken. For example, my Synology NAS runs a reverse proxy that returns 200 for error pages. A generic curl check passed, but the actual service behind it was down.

Now I check for specific response content, not just status codes.

Ignoring Dependencies

My n8n container depends on PostgreSQL. I originally only checked n8n itself, but if the database was down, n8n would start, fail to connect, and sit in a broken state while Docker thought it was healthy.

I added a dependency check in the healthcheck script:

#!/bin/sh
# Check database connection first
if ! nc -z postgres 5432; then
  exit 1
fi

# Then check n8n
response=$(curl -sf http://localhost:5678/healthz)
if echo "$response" | grep -q "ok"; then
  exit 0
else
  exit 1
fi

This isn't perfect—it assumes nc is available in the container—but it catches the most common failure mode.

How I Test Healthchecks

Before deploying a new healthcheck, I test it manually inside the running container:

docker exec -it container_name /bin/sh
/app/healthcheck.sh
echo $?

If the script returns 0, it's healthy. If it returns 1, it's unhealthy. I deliberately break the service (stop a process, kill the database connection) and verify the healthcheck detects it.

I also watch the container status over time:

docker inspect --format='{{.State.Health.Status}}' container_name

If it flaps between healthy and unhealthy, the check is too sensitive or the service is genuinely unstable.

Docker Compose Integration

In my docker-compose.yml files, I define healthchecks inline and use depends_on with service_healthy to enforce startup order:

version: '3.8'
services:
  postgres:
    image: postgres:14
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U myuser && psql -U myuser -d mydb -c 'SELECT 1' > /dev/null"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s

  n8n:
    image: n8nio/n8n
    depends_on:
      postgres:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "/app/healthcheck.sh"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 40s

This ensures n8n only starts after PostgreSQL is confirmed healthy. Before I added this, n8n would start too early, fail to connect, and enter a broken state.

Limitations I've Hit

Healthchecks aren't a silver bullet. Here's what they don't solve:

Slow Degradation

If a service gradually slows down but keeps responding, the healthcheck won't catch it. I've had containers that took 30 seconds to respond to requests but still passed the healthcheck because the timeout was set to 60 seconds.

I haven't solved this perfectly. For critical services, I use external monitoring (OneUptime) to track response times, not just availability.

Resource Exhaustion

A container can be "healthy" while consuming 100% CPU or running out of memory. Healthchecks don't monitor resource usage—they only check application logic.

I rely on Proxmox's built-in resource monitoring for this. If a container consistently maxes out CPU, I investigate even if it's marked healthy.

False Positives During High Load

Under heavy load, healthchecks can timeout even when the service is functioning. I saw this with my PostgreSQL container during a large data import—the healthcheck query timed out, Docker marked it unhealthy, and restarted it mid-import.

I increased the timeout to 10 seconds and added more retries. This reduced false positives but made detection slower.

Key Takeaways

After months of tuning healthchecks across my self-hosted stack, here's what actually matters:

  • Test the real functionality of your service, not just port availability
  • Set --start-period to at least double your observed startup time
  • Use longer intervals (30s+) unless you need fast failure detection
  • Check dependencies before checking the service itself
  • Test healthchecks manually before deploying them
  • Accept that healthchecks won't catch everything—use external monitoring for critical services

Healthchecks are a tool, not a guarantee. They catch obvious failures—crashed processes, broken database connections, unresponsive HTTP servers. They don't catch subtle issues like slow performance, memory leaks, or logical errors in your application.

For my setup, that's enough. I catch most problems early, Docker auto-restarts broken containers, and I spend less time manually checking if services are alive. The rest I handle with monitoring and logs.