Why I Built This
I run about 15 Docker containers across my Proxmox cluster and Synology NAS. Most of them just work, but some have this annoying habit of silently dying. A backup script stops running. A monitoring container crashes. An API service hangs without logging anything useful.
I'd find out days later when I actually needed the service. That's embarrassing when it's your own infrastructure.
I looked at Healthchecks.io first—it's well-designed and handles cron monitoring properly. But I didn't want to send pings to an external service for internal infrastructure checks. I also didn't want to self-host their full Django application just for basic monitoring.
What I needed was simple: check if containers are responding, get notified immediately if they're not, and restart them automatically. All running locally.
My Actual Setup
Here's what I'm working with:
- Proxmox VE 8.x running multiple LXC containers and VMs
- Docker hosts spread across different machines
- ntfy running in a container for push notifications (I already use this for other alerts)
- Standard cron on the Docker hosts
- Shell scripts with curl for health checks
I chose ntfy because I already had it running, and it sends notifications to my phone without requiring API keys or external services. It's just HTTP POST requests to a local endpoint.
How I Check Container Health
Most health check tutorials assume your service has a built-in health endpoint. Mine don't. So I check three things:
1. Is the container running?
#!/bin/bash
CONTAINER="n8n"
if ! docker ps --format '{{.Names}}' | grep -q "^${CONTAINER}$"; then
echo "Container ${CONTAINER} is not running"
exit 1
fi
This catches containers that have stopped or crashed completely.
2. Is it responding on its port?
#!/bin/bash
SERVICE_URL="http://localhost:5678"
if ! curl -f -s -m 10 "${SERVICE_URL}" > /dev/null; then
echo "Service at ${SERVICE_URL} is not responding"
exit 1
fi
The -f flag makes curl fail on HTTP errors. The -m 10 sets a 10-second timeout. I learned to add the timeout after a hung container made the script wait indefinitely.
3. Can it actually do something?
For services with APIs, I do a simple operation:
#!/bin/bash
API_RESPONSE=$(curl -s -m 10 "http://localhost:8080/api/health")
if [[ "${API_RESPONSE}" != *"ok"* ]]; then
echo "API health check failed"
exit 1
fi
This caught cases where the container was running and the port was open, but the application inside had deadlocked.
Sending Notifications Through ntfy
When a check fails, I send a notification immediately:
#!/bin/bash
NTFY_URL="http://ntfy.local:8080/monitoring"
CONTAINER="n8n"
send_notification() {
local title="$1"
local message="$2"
local priority="${3:-default}"
curl -H "Title: ${title}" \
-H "Priority: ${priority}" \
-d "${message}" \
"${NTFY_URL}"
}
# Run health checks here...
if [ $? -ne 0 ]; then
send_notification \
"Container Failed: ${CONTAINER}" \
"Health check failed at $(date)" \
"urgent"
exit 1
fi
The priority field makes my phone actually alert me instead of just logging the notification silently. I use "urgent" for failures and "default" for recovery notifications.
Automatic Container Restarts
Here's where I had to make a decision: restart immediately or wait?
I tried immediate restarts first. Bad idea. Some containers fail during deployment or updates, and the script would restart them mid-update, corrupting data.
My current approach: check twice, five minutes apart, then restart.
#!/bin/bash
CONTAINER="n8n"
NTFY_URL="http://ntfy.local:8080/monitoring"
FAILURE_FLAG="/tmp/healthcheck_${CONTAINER}_failed"
run_health_check() {
# All the health check logic from above
# Returns 0 on success, 1 on failure
}
if ! run_health_check; then
if [ -f "${FAILURE_FLAG}" ]; then
# Second consecutive failure - restart
send_notification \
"Restarting ${CONTAINER}" \
"Two consecutive failures detected" \
"urgent"
docker restart "${CONTAINER}"
sleep 30
if run_health_check; then
send_notification \
"${CONTAINER} Recovered" \
"Container restarted successfully" \
"default"
rm -f "${FAILURE_FLAG}"
else
send_notification \
"${CONTAINER} Restart Failed" \
"Container still unhealthy after restart" \
"urgent"
fi
else
# First failure - just flag it
touch "${FAILURE_FLAG}"
send_notification \
"${CONTAINER} Health Check Failed" \
"Will restart if next check fails" \
"high"
fi
else
# Health check passed - clear any failure flags
rm -f "${FAILURE_FLAG}"
fi
The flag file approach is crude but works reliably. I considered using a proper state file with timestamps, but this is simpler and has never failed me.
Setting Up the Cron Jobs
I run health checks every 5 minutes for critical services, every 15 minutes for everything else:
# Critical services - every 5 minutes
*/5 * * * * /opt/scripts/healthcheck_n8n.sh >> /var/log/healthchecks/n8n.log 2>&1
# Standard services - every 15 minutes
*/15 * * * * /opt/scripts/healthcheck_syncthing.sh >> /var/log/healthchecks/syncthing.log 2>&1
*/15 * * * * /opt/scripts/healthcheck_homeassistant.sh >> /var/log/healthchecks/homeassistant.log 2>&1
I log everything because when something breaks at 3 AM, I want to see the full timeline. The logs showed me that one container was failing health checks for exactly 2 minutes every night at midnight—turned out to be an internal maintenance job that locked the database.
What Didn't Work
Docker's built-in health checks
Docker has HEALTHCHECK instructions in Dockerfiles. I tried using those with docker inspect to check container health. The problem: most of my containers don't have health checks defined, and adding them means rebuilding images or maintaining custom Dockerfiles.
External health checks work with any container, regardless of how it was built.
Monitoring container stats
I tried checking CPU and memory usage as health indicators. A container using 0% CPU must be dead, right? Wrong. Some of my services are event-driven and sit idle most of the time. Others spike to 100% CPU during normal operation.
Resource monitoring is useful for capacity planning, not health checks.
Single-check restarts
As mentioned earlier, restarting on the first failure caused problems during updates and created restart loops when a service had a persistent configuration issue.
Checking too frequently
I initially ran checks every minute. This generated too many notifications during legitimate issues (like when I'm deliberately restarting something) and put unnecessary load on the services being checked.
Five minutes is frequent enough to catch real problems quickly but slow enough to avoid false positives.
Real Problems This Caught
Within the first week, the health checks caught:
- My n8n container dying every few days due to a memory leak (fixed by adding a memory limit and automatic restart policy)
- A DNS container that stopped responding but didn't crash (never figured out why, but now it restarts when it happens)
- A backup script that silently failed when the destination disk filled up
- Network issues between containers that made services unreachable even though they were running
The notification about the DNS container came through while I was traveling. I was able to SSH in and check the logs before users (my family) noticed anything was wrong.
Current Limitations
This setup doesn't handle:
- Containers that need specific startup sequences or dependencies
- Stateful services where restart order matters
- Problems that require more than a simple restart (like full data corruption)
- Cascading failures where one service depends on another
For those cases, I still get the notification, but I have to intervene manually.
I also don't check services running on my Synology NAS the same way because I don't have direct shell access configured there. Those still need manual monitoring.
Key Takeaways
Simple health checks with cron, curl, and ntfy work well for small self-hosted setups. You don't need a complex monitoring platform to catch most container failures.
Checking twice before restarting prevents false positives and avoids making problems worse during updates.
Logging everything is worth the disk space. When something breaks, having the timeline matters more than saving a few megabytes.
External health checks are more flexible than Docker's built-in HEALTHCHECK because they work with any container and can test actual functionality, not just process liveness.
The notification part is as important as the detection. Getting alerted immediately means I can fix things before they become bigger problems or before users notice.