Why I Started Using Docker Health Checks
I run a handful of services in Docker that depend on external APIs — weather data feeds, RSS aggregators, and a few automation workflows that pull from third-party endpoints. For a long time, I relied on Docker’s default behavior: if the container process is running, Docker assumes everything is fine.
That worked until it didn’t. I’d find containers technically “up” but completely useless because an upstream API was down, a network route changed, or the service inside the container crashed in a way that left the process alive but unresponsive. The container stayed running, but my workflows silently failed.
I needed a way to detect these failures and restart containers automatically when they couldn’t reach their dependencies. That’s when I started using Docker Compose health checks with automatic recreation.
My Setup and What I Actually Use
Most of my stack runs on Proxmox VMs with Docker Compose managing individual services. I have containers that:
- Scrape and process data from external APIs
- Run n8n workflows that depend on webhook endpoints
- Monitor RSS feeds and news sources
- Interact with local services like PostgreSQL and Redis
These aren’t mission-critical production systems, but they need to stay functional without constant babysitting. When an upstream API goes down or returns errors, I want the container to recognize the failure and restart — not sit there pretending everything is fine.
How I Implement Health Checks in Docker Compose
Docker Compose lets you define health checks directly in the compose file. Here’s a real example from one of my RSS processing services:
services:
rss-processor:
image: my-rss-processor:latest
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 60s
timeout: 10s
retries: 3
start_period: 30s
The parameters I use:
- test: The command that checks if the service is healthy. I use curl for HTTP endpoints.
- interval: How often to run the check. I use 60 seconds for most services to avoid unnecessary load.
- timeout: Maximum time to wait for a response. 10 seconds works for most external APIs.
- retries: How many failures before marking the container unhealthy. I use 3 to avoid false positives from temporary network glitches.
- start_period: Grace period after container startup. Some services take time to initialize.
This setup marks the container unhealthy if the health endpoint fails three times in a row. Combined with restart: unless-stopped, Docker automatically recreates the container when it becomes unhealthy.
Checking External API Dependencies
For containers that depend on external APIs, I don’t just check if my service is running — I check if it can actually reach the upstream dependency. Here’s an example for a service that pulls weather data:
healthcheck: test: ["CMD", "curl", "-f", "https://api.weather.example.com/status"] interval: 120s timeout: 15s retries: 2 start_period: 20s
This directly tests the upstream API. If the API is unreachable or returns errors, the health check fails and Docker restarts the container. Sometimes a fresh start is enough to recover from transient network issues or stale connections.
The Problem with Missing Dependencies in Health Checks
One mistake I made early on was assuming curl was available in every container. It’s not. Many lightweight images (especially Alpine-based ones) don’t include it by default.
When I first deployed a health check that used curl, the container was marked unhealthy immediately. The logs showed nothing useful — just “unhealthy” status with no clear reason. I had to manually inspect the container to figure out what was happening:
docker exec -it my-container sh curl -f http://localhost:3000/health
The error: sh: curl: not found
The health check was failing because the command itself couldn’t run. Docker doesn’t distinguish between “the service is broken” and “the health check command is broken” — both result in an unhealthy container.
How I Fixed It
For images I control, I install curl in the Dockerfile:
RUN apk add --no-cache curl
For images I don’t control, I use alternatives that are more likely to be present. For example, wget is often available:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/health"]
Or I use basic TCP checks with nc (netcat), which is included in many minimal images:
test: ["CMD", "nc", "-z", "localhost", "3000"]
This doesn’t test the HTTP endpoint itself, but it confirms the port is open and listening. For some use cases, that’s enough.
What Worked and What Didn’t
What worked:
- Health checks that test actual dependencies, not just local endpoints. If my service depends on an external API, checking that API directly catches more real failures.
- Using longer intervals (60-120 seconds) to avoid hammering containers with constant checks. Shorter intervals add unnecessary overhead.
- Setting
retries: 3to tolerate brief network hiccups without triggering a restart. - Combining health checks with
restart: unless-stoppedfor automatic recovery.
What didn’t work:
- Assuming health check tools are always available. I wasted time debugging “unhealthy” containers that were actually fine — the health check command was just missing.
- Using very short intervals (10-20 seconds). This created noise in logs and added measurable CPU load on low-resource VMs.
- Relying only on local endpoint checks. A container can respond to
http://localhost:3000/healthwhile being completely unable to reach its external dependencies. - Not setting
start_period. Some services take 20-30 seconds to initialize, and health checks during that time produce false failures.
When Automatic Restarts Aren’t Enough
Health checks with automatic restarts solve many problems, but they’re not a universal fix. If an upstream API is down for hours, restarting the container won’t help — it will just keep failing and restarting in a loop.
For those cases, I use monitoring and alerting outside Docker. I run a simple monitoring script (via Cronicle) that checks service availability and sends notifications when something stays down for more than a few restart cycles. That tells me when I need to intervene manually or disable a service temporarily.
I also log health check failures to a persistent volume so I can review patterns later. Sometimes a service is flaky in ways that only show up over days or weeks, and logs help identify those trends.
Key Takeaways
- Health checks are useful when containers depend on external services or APIs that can fail independently of the container itself.
- Check actual dependencies, not just local endpoints. If your service relies on an external API, test that API directly.
- Make sure the health check command exists in the container. Missing tools like curl cause misleading “unhealthy” statuses.
- Use reasonable intervals (60+ seconds) to avoid unnecessary load. Containers don’t need to be checked every 10 seconds.
- Set retries to tolerate transient failures. Networks glitch, APIs hiccup — don’t restart on the first failure.
- Automatic restarts won’t fix persistent upstream failures. Use monitoring to detect when something is stuck in a restart loop.
Health checks aren’t magic, but they’ve made my self-hosted services more resilient to the kinds of failures that used to require manual intervention. They work best when combined with realistic expectations and external monitoring for cases where restarts don’t help.