Building a cron-based backup verification system: testing...

Why I Built This

I run automated backups for everything in my homelab—VMs on Proxmox, Docker volumes, configuration files, databases. But backups are worthless if you can't restore them. I learned this the hard way when a corrupted backup sat unnoticed for weeks until I actually needed it.

Manual restore testing is tedious. I would randomly pick a backup, restore it somewhere, check if it worked, then clean up. This happened maybe once a month if I remembered. I needed something that ran automatically, verified the data was actually usable, and told me immediately if something broke.

I built a cron-based system that restores backups to a test environment, validates them using hash checks, and sends Slack alerts with the results. It runs nightly for critical systems and weekly for everything else.

My Real Setup

I use restic for backups, storing them on a Synology NAS and in Backblaze B2. The verification system runs on a dedicated VM in Proxmox with:

Ubuntu 22.04
Restic installed for restore operations
A separate restore directory that gets wiped after each test
Cron jobs for scheduling
A bash script that handles the restore and validation logic
Slack webhook for notifications

The backup sources include:

PostgreSQL databases (n8n, monitoring tools)
Docker volumes (configuration, persistent data)
Configuration files from various services
VM snapshots

Each backup type has a manifest file that lists expected files and their SHA256 hashes. These manifests are created during the backup process and stored separately from the backup data itself.

How the Verification Works

The core script does four things:

1. Restore from Backup

Restic restores the most recent snapshot to a temporary directory. I use --target to specify where files go and --verify to have restic check its own integrity during restore.

restic -r /mnt/backup/repo restore latest \
  --target /tmp/restore-test \
  --verify

2. Hash Validation

After restore completes, the script reads the manifest file and compares expected hashes against actual files. This catches silent corruption that might pass restic's own checks.

while IFS= read -r line; do
  expected_hash=$(echo "$line" | cut -d' ' -f1)
  file_path=$(echo "$line" | cut -d' ' -f2-)
  
  if [ -f "$file_path" ]; then
    actual_hash=$(sha256sum "$file_path" | cut -d' ' -f1)
    
    if [ "$expected_hash" != "$actual_hash" ]; then
      echo "HASH MISMATCH: $file_path"
      validation_failed=true
    fi
  else
    echo "MISSING FILE: $file_path"
    validation_failed=true
  fi
done < manifest.txt

This approach only works because I generate manifests during backup. For databases, I dump to SQL first, hash the dump file, then back up both the dump and manifest.

3. Service-Specific Tests

For databases, hash validation alone isn't enough. I also restore the SQL dump to a test database and run basic queries to confirm the data is actually readable.

# Restore PostgreSQL dump
psql -U testuser -d testdb < /tmp/restore-test/db_dump.sql

# Run validation query
psql -U testuser -d testdb -c "SELECT COUNT(*) FROM critical_table;"

If the query fails or returns zero rows when it shouldn't, the test fails even if hashes matched.

4. Slack Notification

Results go to a dedicated Slack channel. Success messages are brief. Failures include which files failed validation and error logs.

curl -X POST -H 'Content-type: application/json' \
  --data "{\"text\":\"Backup verification FAILED for $backup_name\n$error_details\"}" \
  $SLACK_WEBHOOK_URL

What Worked

Running this nightly caught three real problems in six months:

A Docker volume backup that was silently truncating files over 2GB due to a filesystem issue
A PostgreSQL dump that completed successfully but produced corrupted SQL because of a version mismatch in pg_dump
A configuration file that backed up fine but couldn't be restored because of permission issues I never noticed

None of these would have been caught by just checking if the backup job succeeded. The hash validation found the truncation immediately. The database restore test caught the SQL corruption. The permission issue only showed up when actually trying to use the restored files.

Slack alerts work well because they're immediate and visible. I see failures in the morning and can investigate before they become urgent.

Using a separate VM for testing keeps the process isolated. If a restore corrupts something or uses too many resources, it doesn't affect production systems.

What Didn't Work

My first version tried to validate everything every night. This took too long and filled up the restore directory faster than I expected. I had to split critical systems (daily) from everything else (weekly).

I initially stored manifests inside the backup itself. This created a chicken-and-egg problem—I needed to restore the backup to get the manifest, but needed the manifest to validate the restore. Moving manifests to a separate location fixed this but added complexity.

Hash validation of large VM images was impractical. A 100GB VM image takes significant time to hash, and any tiny change makes the hash useless. For VMs, I switched to just checking if the restore completes and if the VM can boot in a test environment. This is less precise but actually useful.

The first Slack integration sent too much detail. Multi-page error dumps in Slack are unreadable. I now send a summary with a link to full logs stored on the verification VM.

Cron timing was tricky. Backups run at 2 AM, verification started at 3 AM. But some backups took longer than an hour, so verification ran against incomplete backups. I added a check to confirm the backup timestamp is recent enough before starting verification.

Key Decisions

Using restic's built-in verification plus custom hash checks provides two layers. Restic catches repository corruption. Hash checks catch problems in the actual file data.

Generating manifests during backup, not after, ensures they're always in sync with what was backed up. This adds a step to the backup process but makes validation reliable.

Testing databases by actually restoring and querying them is more work than just checking hashes, but it's the only way to know the data is usable. Hashes can match on a corrupted SQL dump.

Separate restore environments prevent validation from affecting production. This also lets me test restore procedures in conditions closer to a real disaster recovery scenario.

Keeping verification logs on the test VM for 30 days gives enough history to spot patterns without filling up storage. Slack gets summaries, the VM keeps details.

What I Learned

Automated verification is not the same as automated backups. Backups can succeed while producing unusable data. Testing restores regularly is the only way to know your backups actually work.

Hash validation catches more problems than I expected, but it requires discipline in generating and maintaining manifests. This overhead is worth it.

p>Not everything can be validated the same way. Files get hash checks, databases get restore tests, VMs get boot tests. The verification system has to adapt to what's being backed up.

Immediate notification matters. A failed verification is only useful if you know about it quickly. Slack works for me; email would probably get ignored.

The restore process itself needs testing, not just the backup data. Permissions, paths, dependencies—these can all break a restore even if the backup is perfect.

Tech Expert & Vibe Coder

Why I Built This

My Real Setup

How the Verification Works

1. Restore from Backup

2. Hash Validation

3. Service-Specific Tests

4. Slack Notification

What Worked

What Didn't Work

Key Decisions

What I Learned

Category:

Implementing bash script...

Monitoring n8n workflow...

Categories

Related Posts

Implementing bash script circuit breakers for...

Monitoring n8n workflow execution costs: tracking...

Building a Cron-Based Website Availability...

About Me

Vipin PG

Tech Expert & Vibe Coder

Building a cron-based backup verification system: testing restore integrity with automated hash validation and slack alerts

Why I Built This

My Real Setup

How the Verification Works

1. Restore from Backup

2. Hash Validation

3. Service-Specific Tests

4. Slack Notification

What Worked

What Didn't Work

Key Decisions

What I Learned

Category:

Implementing bash script...

Monitoring n8n workflow...

Subscribe to Newsletter

Categories

Related Posts

Implementing bash script circuit breakers for...

Monitoring n8n workflow execution costs: tracking...

Building a Cron-Based Website Availability...

About Me

Vipin PG