Why I Built This
I run Docker Compose stacks across multiple Proxmox LXC containers and VMs. When something breaks—a container crashes, a volume fills up, or a service stops responding—I need to know immediately. Checking each host manually wastes time and doesn't scale.
I already had Grafana and Loki running for general monitoring, but I needed something specific: a script that could check Docker Compose stack health on each host and push structured logs to Loki. Not metrics. Not generic container stats. Just clear health checks with context I could query later.
My Real Setup
I have four hosts running Docker Compose stacks:
- One control host (Proxmox LXC) running Grafana and Loki
- Three remote hosts (mix of LXCs and VMs) running various application stacks
Each remote host needed to run a health check script on a schedule and send results to the central Loki instance. I didn't want to install Promtail on every host just for this—too much overhead for what I needed.
The script had to:
- Check if Docker Compose stacks are running
- Identify unhealthy or stopped containers
- Send structured logs directly to Loki's HTTP API
- Run via cron without manual intervention
The Script I Actually Use
Here's the bash script I wrote. It's not elegant, but it works reliably:
#!/bin/bash
LOKI_URL="http://<control-host-ip>:3100/loki/api/v1/push"
HOSTNAME=$(hostname)
TIMESTAMP=$(date +%s%N)
check_stack_health() {
local stack_dir=$1
local stack_name=$(basename "$stack_dir")
cd "$stack_dir" || return 1
if [ ! -f "docker-compose.yml" ]; then
send_log "error" "No docker-compose.yml found in $stack_dir"
return 1
fi
local containers=$(docker-compose ps -q)
if [ -z "$containers" ]; then
send_log "warning" "No containers found for stack: $stack_name"
return 0
fi
local unhealthy=0
local stopped=0
for container in $containers; do
local status=$(docker inspect --format='{{.State.Status}}' "$container")
local health=$(docker inspect --format='{{.State.Health.Status}}' "$container" 2>/dev/null)
local name=$(docker inspect --format='{{.Name}}' "$container" | sed 's/\///')
if [ "$status" != "running" ]; then
send_log "error" "Container $name in stack $stack_name is $status"
((stopped++))
elif [ "$health" = "unhealthy" ]; then
send_log "warning" "Container $name in stack $stack_name is unhealthy"
((unhealthy++))
fi
done
if [ $unhealthy -eq 0 ] && [ $stopped -eq 0 ]; then
send_log "info" "Stack $stack_name is healthy"
fi
}
send_log() {
local level=$1
local message=$2
local json_payload=$(cat <<EOF
{
"streams": [
{
"stream": {
"job": "docker-health-check",
"host": "$HOSTNAME",
"level": "$level"
},
"values": [
["$TIMESTAMP", "$message"]
]
}
]
}
EOF
)
curl -s -X POST "$LOKI_URL" \
-H "Content-Type: application/json" \
-d "$json_payload" > /dev/null
}
# Main execution
STACK_DIRS=(
"/opt/stacks/app1"
"/opt/stacks/app2"
"/opt/stacks/app3"
)
for stack_dir in "${STACK_DIRS[@]}"; do
check_stack_health "$stack_dir"
done
I saved this as /usr/local/bin/check-docker-health.sh on each remote host and made it executable.
How I Schedule It
I added a cron job to run the script every 5 minutes:
*/5 * * * * /usr/local/bin/check-docker-health.sh
This runs silently. All output goes to Loki, not to email or local logs.
What Worked
The script correctly identifies stopped and unhealthy containers. I can query Loki in Grafana using LogQL:
{job="docker-health-check", level="error"}
This shows me all failures across all hosts in one view. I can filter by host, stack name, or time range.
The structured logging format makes it easy to build alerts. I set up a Grafana alert that triggers when any host reports an error-level log. It sends a notification to my Discord webhook.
Using Loki's HTTP API directly was the right choice. No extra agents, no file tailing, no log rotation issues. The script just pushes JSON and moves on.
What Didn't Work
My first version tried to parse docker-compose ps output as text. This broke when container names had spaces or special characters. I switched to using docker-compose ps -q to get container IDs, then used docker inspect for reliable status checks.
I initially sent logs at nanosecond precision (date +%s%N), but Loki expects timestamps in nanoseconds as a string. If you send microseconds or milliseconds, logs won't appear in order. This took me an hour to figure out.
Health checks don't work for containers without a HEALTHCHECK defined in their Dockerfile or Compose file. The script returns "no value" for health status in those cases. I had to add explicit health checks to several of my stacks to make this useful.
Curl failures are silent in the script. If Loki is down or unreachable, the script doesn't know. I considered adding error handling, but decided against it—if Loki is down, I have bigger problems and will notice through other monitoring.
Key Takeaways
Direct HTTP logging to Loki works well for custom health checks. You don't need Promtail for everything.
Structured logs with consistent labels make querying and alerting straightforward. I use job, host, and level as my base labels.
Docker health checks must be explicitly defined. Relying on container state alone isn't enough for real health monitoring.
Bash scripts for monitoring should fail silently and log remotely. Local debugging output just creates noise.
This setup has been running for three months across my hosts. It's caught several issues I would have missed otherwise—mostly containers that restarted but came up in an unhealthy state.