Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Implementing Prometheus Alerting for Silent Cron Job Failures:  Detecting Missing Log Outputs and Stale Timestamps in Systemd Timer Workflows

Why I Built Prometheus Alerting for Systemd Timer Failures

I run several systemd timers on my Proxmox host and inside various VMs. They handle backups, log rotation, certificate renewals, and periodic cleanup tasks. For months, I assumed they were working because nothing visibly broke. Then I needed to restore a backup and discovered my backup timer had been failing silently for three weeks.

The timer was still enabled. Systemd showed no errors in its status. The unit just stopped producing output files. No alerts fired because I had no alerts configured. I only noticed when I needed the data that wasn’t there.

That failure pushed me to build proper monitoring. I already had Prometheus running for container metrics, so I extended it to watch for two specific failure modes: missing log outputs and stale timestamps from jobs that should run on a schedule.

The Problem with Systemd Timer Monitoring

Systemd timers fail differently than services. A service that crashes generates clear signals—systemd marks it as failed, logs show errors, and the unit enters a failed state. Timers are trickier because the timer itself can be perfectly healthy while the service it triggers silently fails or never runs.

Here’s what I’ve seen go wrong:

  • The service exits with code 0 but doesn’t actually complete its work
  • The service writes to a log file that fills the disk, causing subsequent runs to fail
  • A dependency (like a network mount) isn’t available when the timer fires
  • The timer runs but the service takes longer than expected and gets killed
  • Systemd itself has issues and the timer never fires at all

Traditional monitoring checks if a service is running. But timers aren’t supposed to be running continuously—they fire, execute, and stop. The absence of activity is normal until it isn’t.

My Monitoring Setup

I use Prometheus with node_exporter on each host. Node_exporter exposes systemd metrics, but they don’t directly tell you “this timer hasn’t produced output in X hours.” I had to build that detection myself using two approaches: timestamp checking and log output monitoring.

Approach 1: Monitoring Timestamps with Textfile Collector

The first method tracks when a job last completed successfully. Each timer service writes a timestamp to a file that node_exporter reads.

I modified my backup service to include this at the end:

#!/bin/bash
# /usr/local/bin/backup-vm-configs.sh

set -e

# Actual backup work
rsync -a /etc/pve/ /mnt/backup/pve-config/

# Write success timestamp for Prometheus
echo "backup_last_success_timestamp $(date +%s)" > /var/lib/node_exporter/textfile_collector/backup_vm_configs.prom

The textfile collector directory is configured in my node_exporter systemd service:

[Service]
ExecStart=/usr/local/bin/node_exporter 
  --collector.textfile.directory=/var/lib/node_exporter/textfile_collector 
  --collector.systemd

Now Prometheus scrapes that metric every 15 seconds. The metric looks like this:

backup_last_success_timestamp 1704891234

I created an alert rule that fires if the timestamp is too old:

groups:
- name: systemd_timer_alerts
  rules:
  - alert: BackupJobStale
    expr: |
      (time() - backup_last_success_timestamp) > 86400
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "VM config backup hasn't run in 24 hours"
      description: "Last successful backup was {{ $value | humanizeDuration }} ago"

This catches the case where the timer stops firing entirely or the service fails before writing the timestamp.

Approach 2: Monitoring Log Output

Some of my timers don’t produce a single output file—they write to logs or update multiple files. For these, I monitor whether the log file has been modified recently.

I use a small script that runs via its own timer every 5 minutes:

#!/bin/bash
# /usr/local/bin/check-log-freshness.sh

LOG_FILE="/var/log/cert-renewal.log"
MAX_AGE_SECONDS=604800  # 7 days

if [ -f "$LOG_FILE" ]; then
  LAST_MODIFIED=$(stat -c %Y "$LOG_FILE")
  echo "cert_renewal_log_last_modified_timestamp $LAST_MODIFIED" > /var/lib/node_exporter/textfile_collector/cert_renewal_log.prom
else
  echo "cert_renewal_log_last_modified_timestamp 0" > /var/lib/node_exporter/textfile_collector/cert_renewal_log.prom
fi

The corresponding alert:

- alert: CertRenewalLogStale
  expr: |
    (time() - cert_renewal_log_last_modified_timestamp) > 604800
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Certificate renewal log hasn't been updated in 7 days"

This approach has a weakness: if the service runs but fails immediately, it might still touch the log file. I only use this for jobs where I know they write meaningful output on every run.

Approach 3: Using Systemd Metrics Directly

Node_exporter exposes systemd unit metrics, including when a service last ran. I tried using these initially but found them less reliable than writing my own timestamps.

The relevant metrics are:

node_systemd_unit_state{name="backup-vm-configs.service",state="active"}
node_systemd_timer_last_trigger_seconds{name="backup-vm-configs.timer"}

The problem: node_systemd_timer_last_trigger_seconds tells you when the timer fired, not whether the service succeeded. A service can fail immediately and the timer still shows a recent trigger time.

I do use these metrics for one specific alert—detecting when a timer is disabled:

- alert: SystemdTimerDisabled
  expr: |
    node_systemd_timer_last_trigger_seconds{name=~"backup.*"} == 0
  for: 1h
  labels:
    severity: info
  annotations:
    summary: "Timer {{ $labels.name }} appears to be disabled"

What Didn’t Work

My first attempt used a single monitoring script that checked all timers and wrote all metrics at once. This created a problem: if the monitoring script itself failed, all metrics disappeared simultaneously and every alert fired. False positives are worse than no alerts.

I also tried monitoring the exit code of services using systemd’s ExecStopPost to write success/failure metrics. This broke when I had services that legitimately exit with non-zero codes for certain conditions. Distinguishing between “failed” and “nothing to do” required too much service-specific logic.

I experimented with using Prometheus’s absent() function to alert when metrics disappear, but this fired constantly during the first few hours after adding new monitoring because Prometheus hadn’t scraped the metrics yet. The for clause helps but doesn’t eliminate the issue entirely.

Handling the Monitoring Script Itself

The monitoring scripts that write timestamps are a single point of failure. If they break, I lose visibility. I handle this with a meta-monitoring approach: the monitoring script itself writes a heartbeat metric.

#!/bin/bash
# At the end of check-log-freshness.sh

echo "monitoring_script_last_run_timestamp $(date +%s)" >> /var/lib/node_exporter/textfile_collector/monitoring_heartbeat.prom

Then I alert if the monitoring script stops running:

- alert: MonitoringScriptStale
  expr: |
    (time() - monitoring_script_last_run_timestamp) > 900
  for: 15m
  labels:
    severity: critical
  annotations:
    summary: "Monitoring script hasn't run in 15 minutes"

This creates a dependency chain: Prometheus monitors the monitoring script, which monitors the actual jobs. It’s not perfect—if Prometheus itself fails, nothing alerts—but it’s better than having no checks on the monitoring infrastructure.

Real Failures This Caught

Since implementing this, I’ve caught three real failures:

Backup timer stopped due to full disk: The backup destination filled up. The rsync command failed, never wrote the success timestamp, and I got alerted within 24 hours instead of discovering it weeks later.

Certificate renewal service misconfigured: I changed the path to the renewal script but forgot to update the systemd service file. The timer fired on schedule but the service failed immediately. The log file stopped being updated and the alert fired after 7 days.

Network mount unavailable: One of my timers depends on an NFS mount. The NFS server rebooted during maintenance and didn’t come back up cleanly. The timer ran but couldn’t access the mount point. Because the service exited with an error before writing the timestamp, I got alerted.

I also had two false positives: once when I manually disabled a timer for testing and forgot to re-enable it, and once when I changed the monitoring script and introduced a syntax error that prevented it from writing metrics.

Key Takeaways

Monitoring systemd timers requires tracking absence of expected activity, not just presence of failures. Writing success timestamps from within the job itself is more reliable than trying to infer success from systemd’s state.

Keep monitoring scripts simple and independent. Each job should write its own metric rather than having one script monitor everything.

Set alert thresholds based on your actual schedule plus a reasonable grace period. A daily backup should alert if it hasn’t run in 36 hours, not 25 hours, to avoid alerts during legitimate delays.

Test your alerts by deliberately breaking things. I disabled a timer and waited to see if the alert fired. It did, but 2 hours later than I expected because I miscalculated the for clause duration.

This approach isn’t perfect. If Prometheus itself fails, I get no alerts. For truly critical jobs, I’d add external monitoring (like a dead man’s switch that expects a ping), but for my homelab setup, Prometheus-based monitoring has been sufficient and caught real failures I would have otherwise missed.

Leave a Comment

Your email address will not be published. Required fields are marked *