Why I Built This
I run local package mirrors for Arch and Gentoo on my home network. These mirrors sync from upstream repositories every few hours using cron jobs. When a sync fails silently—network issues, upstream problems, disk space—my local machines keep pulling stale packages, or worse, partial repository states that break installations.
I needed to know when syncs failed, but I didn’t want to parse cron emails or check logs manually. I already had Alertmanager running for my Prometheus setup, so I wanted something that would integrate with that existing alert pipeline.
My Setup
I have two mirror sync jobs running on a Proxmox VM:
- Arch mirror: syncs every 6 hours using rsync
- Gentoo mirror: syncs every 12 hours, also rsync-based
Both are triggered by systemd timers (I moved away from traditional cron a while back, but the monitoring approach works for either). The sync scripts are bash, nothing fancy—they call rsync with specific flags and log to local files.
For monitoring, I’m using a self-hosted instance of Healthchecks running in Docker on the same Proxmox host. I chose Healthchecks because it’s designed specifically for cron job monitoring and has a clean API for sending pings. The Alertmanager integration happens through Healthchecks’ webhook support.
How the Health Check Works
The pattern is simple: each sync script reports success or failure to Healthchecks via curl at the end of execution. Healthchecks expects a ping within a defined time window. If it doesn’t receive one, or if it receives a failure signal, it triggers an alert.
Here’s what I added to my Arch sync script:
#!/bin/bash
HEALTHCHECK_URL="https://hc.example.com/ping/your-uuid-here"
# Existing rsync logic
rsync -avz --delete rsync://mirror.example.org/archlinux/ /srv/mirrors/arch/
if [ $? -eq 0 ]; then
curl -fsS --retry 3 "${HEALTHCHECK_URL}" > /dev/null
else
curl -fsS --retry 3 "${HEALTHCHECK_URL}/fail" > /dev/null
exit 1
fi
The Gentoo script follows the same pattern. I use the /fail endpoint explicitly when rsync returns a non-zero exit code. This gives me more visibility than just missing pings—I can see in the Healthchecks dashboard whether a job failed or simply didn’t run.
Configuring the Check in Healthchecks
For each mirror, I created a check in Healthchecks with these settings:
- Schedule: Cron expression matching the systemd timer (e.g.,
0 */6 * * *for Arch) - Grace time: 30 minutes—rsync can take a while depending on what changed upstream
- Failure notification: Webhook pointing to Alertmanager
The grace time matters. I initially set it to 10 minutes and got false alerts when large package updates caused longer sync times. 30 minutes has been reliable without being so long that I miss real problems.
Alertmanager Integration
Healthchecks doesn’t have native Alertmanager support, but it has a generic webhook integration. I wrote a small Python script that runs as a systemd service and acts as a bridge. It receives webhooks from Healthchecks and converts them to Alertmanager-compatible alerts.
The bridge script listens on a local port and does this:
from flask import Flask, request
import requests
import json
app = Flask(__name__)
ALERTMANAGER_URL = "http://localhost:9093/api/v1/alerts"
@app.route('/webhook', methods=['POST'])
def webhook():
data = request.json
alert = [{
"labels": {
"alertname": "MirrorSyncFailure",
"job": data.get("name", "unknown"),
"severity": "warning"
},
"annotations": {
"summary": f"Mirror sync failed: {data.get('name')}",
"description": data.get("description", "No details provided")
}
}]
requests.post(ALERTMANAGER_URL, json=alert)
return "OK", 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8090)
This is minimal but functional. Healthchecks sends a POST with check details when something fails, and my script reformats it into the structure Alertmanager expects. From there, Alertmanager handles routing based on my existing configuration—I get notifications through Gotify for these alerts.
What Didn’t Work
My first attempt used Healthchecks’ email integration to send alerts, which I planned to parse with a custom script. This was unnecessarily complex and fragile. Email parsing is always a mess, and I was essentially reinventing what Alertmanager already does well.
I also tried using Prometheus’ Blackbox Exporter to check if the mirror directories had been updated recently by looking at file modification times via HTTP. This worked technically, but it required exposing directory listings over HTTP and added complexity I didn’t need. The direct ping approach from the sync scripts is simpler and more reliable.
One mistake: I initially put the curl command before checking the rsync exit code. This meant successful pings were sent even when rsync failed. Obvious in hindsight, but I only caught it when I noticed I wasn’t getting alerts for a sync that I knew had failed.
Current State and Limitations
This setup has been running for about four months. I’ve caught three real sync failures—two from upstream mirror timeouts and one from a full disk issue I should have noticed earlier. No false positives since I adjusted the grace time.
The webhook bridge script is simple but not robust. It doesn’t handle Alertmanager being down, doesn’t retry failed posts, and doesn’t validate the incoming webhook payload beyond basic JSON parsing. For my use case, this is fine—if Alertmanager is down, I have bigger problems. But it’s not production-grade.
I’m also not monitoring the health of Healthchecks itself, which is a gap. If the Healthchecks container dies, I won’t know until a sync fails and I don’t get alerted. I should probably add a meta-check for this, but I haven’t gotten around to it.
Key Takeaways
Monitoring cron jobs properly requires explicit success/failure reporting, not just log parsing. The ping pattern works well for this.
Grace time configuration matters more than I expected. Set it based on actual job runtime, not what you think it should be.
Integrating with existing alert infrastructure (Alertmanager in my case) is worth the small amount of glue code. It keeps alerts in one place and reuses routing logic you’ve already configured.
The simplest approach that works is usually the right one. Direct pings from scripts beat trying to infer health from external metrics.