Implementing Cron-based Package Mirror Health Checks: Monitoring Arch and Gentoo Repository Sync Failures with Alertmanager Integration

Why I Built This

I run local package mirrors for Arch and Gentoo on my home network. These mirrors sync from upstream repositories every few hours using cron jobs. When a sync fails silently—network issues, upstream problems, disk space—my local machines keep pulling stale packages, or worse, partial repository states that break installations.

I needed to know when syncs failed, but I didn’t want to parse cron emails or check logs manually. I already had Alertmanager running for my Prometheus setup, so I wanted something that would integrate with that existing alert pipeline.

Implementing Cron-based Package Mirror Health Checks: Monitoring Arch and Gentoo Repository Sync Failures with Alertmanager Integration

My Setup

I have two mirror sync jobs running on a Proxmox VM:

Arch mirror: syncs every 6 hours using rsync
Gentoo mirror: syncs every 12 hours, also rsync-based

Both are triggered by systemd timers (I moved away from traditional cron a while back, but the monitoring approach works for either). The sync scripts are bash, nothing fancy—they call rsync with specific flags and log to local files.

For monitoring, I’m using a self-hosted instance of Healthchecks running in Docker on the same Proxmox host. I chose Healthchecks because it’s designed specifically for cron job monitoring and has a clean API for sending pings. The Alertmanager integration happens through Healthchecks’ webhook support.

How the Health Check Works

The pattern is simple: each sync script reports success or failure to Healthchecks via curl at the end of execution. Healthchecks expects a ping within a defined time window. If it doesn’t receive one, or if it receives a failure signal, it triggers an alert.

Here’s what I added to my Arch sync script:

#!/bin/bash
HEALTHCHECK_URL="https://hc.example.com/ping/your-uuid-here"

# Existing rsync logic
rsync -avz --delete rsync://mirror.example.org/archlinux/ /srv/mirrors/arch/

if [ $? -eq 0 ]; then
    curl -fsS --retry 3 "${HEALTHCHECK_URL}" > /dev/null
else
    curl -fsS --retry 3 "${HEALTHCHECK_URL}/fail" > /dev/null
    exit 1
fi

The Gentoo script follows the same pattern. I use the /fail endpoint explicitly when rsync returns a non-zero exit code. This gives me more visibility than just missing pings—I can see in the Healthchecks dashboard whether a job failed or simply didn’t run.

Configuring the Check in Healthchecks

For each mirror, I created a check in Healthchecks with these settings:

Schedule: Cron expression matching the systemd timer (e.g., 0 */6 * * * for Arch)
Grace time: 30 minutes—rsync can take a while depending on what changed upstream
Failure notification: Webhook pointing to Alertmanager

The grace time matters. I initially set it to 10 minutes and got false alerts when large package updates caused longer sync times. 30 minutes has been reliable without being so long that I miss real problems.

Alertmanager Integration

Healthchecks doesn’t have native Alertmanager support, but it has a generic webhook integration. I wrote a small Python script that runs as a systemd service and acts as a bridge. It receives webhooks from Healthchecks and converts them to Alertmanager-compatible alerts.

The bridge script listens on a local port and does this:

from flask import Flask, request
import requests
import json

app = Flask(__name__)
ALERTMANAGER_URL = "http://localhost:9093/api/v1/alerts"

@app.route('/webhook', methods=['POST'])
def webhook():
    data = request.json
    
    alert = [{
        "labels": {
            "alertname": "MirrorSyncFailure",
            "job": data.get("name", "unknown"),
            "severity": "warning"
        },
        "annotations": {
            "summary": f"Mirror sync failed: {data.get('name')}",
            "description": data.get("description", "No details provided")
        }
    }]
    
    requests.post(ALERTMANAGER_URL, json=alert)
    return "OK", 200

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8090)

This is minimal but functional. Healthchecks sends a POST with check details when something fails, and my script reformats it into the structure Alertmanager expects. From there, Alertmanager handles routing based on my existing configuration—I get notifications through Gotify for these alerts.

What Didn’t Work

My first attempt used Healthchecks’ email integration to send alerts, which I planned to parse with a custom script. This was unnecessarily complex and fragile. Email parsing is always a mess, and I was essentially reinventing what Alertmanager already does well.

I also tried using Prometheus’ Blackbox Exporter to check if the mirror directories had been updated recently by looking at file modification times via HTTP. This worked technically, but it required exposing directory listings over HTTP and added complexity I didn’t need. The direct ping approach from the sync scripts is simpler and more reliable.

One mistake: I initially put the curl command before checking the rsync exit code. This meant successful pings were sent even when rsync failed. Obvious in hindsight, but I only caught it when I noticed I wasn’t getting alerts for a sync that I knew had failed.

Current State and Limitations

This setup has been running for about four months. I’ve caught three real sync failures—two from upstream mirror timeouts and one from a full disk issue I should have noticed earlier. No false positives since I adjusted the grace time.

The webhook bridge script is simple but not robust. It doesn’t handle Alertmanager being down, doesn’t retry failed posts, and doesn’t validate the incoming webhook payload beyond basic JSON parsing. For my use case, this is fine—if Alertmanager is down, I have bigger problems. But it’s not production-grade.

I’m also not monitoring the health of Healthchecks itself, which is a gap. If the Healthchecks container dies, I won’t know until a sync fails and I don’t get alerted. I should probably add a meta-check for this, but I haven’t gotten around to it.

Key Takeaways

Monitoring cron jobs properly requires explicit success/failure reporting, not just log parsing. The ping pattern works well for this.

Grace time configuration matters more than I expected. Set it based on actual job runtime, not what you think it should be.

Integrating with existing alert infrastructure (Alertmanager in my case) is worth the small amount of glue code. It keeps alerts in one place and reuses routing logic you’ve already configured.

The simplest approach that works is usually the right one. Direct pings from scripts beat trying to infer health from external metrics.

Tech Expert & Vibe Coder

Implementing Cron-based Package Mirror Health Checks: Monitoring Arch and Gentoo Repository Sync Failures with Alertmanager Integration

Why I Built This

My Setup

How the Health Check Works

Configuring the Check in Healthchecks

Alertmanager Integration

What Didn’t Work

Current State and Limitations

Key Takeaways

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Built This

My Setup

How the Health Check Works

Configuring the Check in Healthchecks

Alertmanager Integration

What Didn’t Work

Current State and Limitations

Key Takeaways

Implementing Mermaid Diagram Rendering in Self-hosted Documentation: Integrating Native Diagram Support Into Wikis and Knowledge Bases

Debugging Bash Script Race Conditions in Parallel Container Deployments: Fixing Lock File Issues When Auto-scaling Docker Services

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Building an n8n Workflow to Sync Obsidian Notes Between Devices Using Syncthing...

Setting Up Gotify Push Notifications for Cron Job Failures Across Multiple...

Building a Self-Healing n8n Workflow for Failed Docker Container Restarts

Get new posts and practical tech notes in your inbox.