Why I Built This
I take snapshots of my TrueNAS datasets every day. They’re automatic, they’re versioned, and they sit there quietly in the background. For a long time, I assumed that meant my backups were solid.
Then I had a moment of doubt: what if a snapshot is corrupted? What if the data I’m snapshotting is already broken? What if I can’t actually restore from these when I need to?
I didn’t want to wait for an emergency to find out. I needed a way to regularly verify that my snapshots could actually be restored and that the data inside them was intact. Not just once, but automatically, on a schedule I could trust.
My Setup
I run TrueNAS Scale on a dedicated box with a ZFS pool that holds my critical datasets. I already had periodic snapshot tasks configured through the TrueNAS UI, creating hourly, daily, and weekly snapshots with retention policies.
On the same network, I have a Proxmox host where I run Docker containers for various automation tasks. I use this environment to spin up lightweight test containers that can mount datasets, run checks, and report results.
The goal was simple: pick a recent snapshot, clone it to a temporary dataset, mount it inside a Docker container, run integrity checks on the files, and log the results. If something fails, I want to know immediately.
The Tools I Used
- TrueNAS Scale with ZFS snapshots already configured
- Docker on a separate Linux host (could also run directly on TrueNAS if you prefer)
- A simple Python script to orchestrate the verification process
- SSH access from the Docker host to TrueNAS for running ZFS commands
- Cronicle for scheduling the verification runs (you could use cron or any other scheduler)
How I Built the Pipeline
Step 1: Listing and Selecting Snapshots
The first thing I needed was a way to programmatically list snapshots and pick one to verify. I wrote a small script that SSHs into my TrueNAS box and runs:
zfs list -t snapshot -o name,creation -s creation
This gives me a list of all snapshots sorted by creation time. I filter for the dataset I care about and grab the most recent daily snapshot. I avoid verifying the very latest snapshot because it might still be in use or actively changing.
Step 2: Cloning the Snapshot
Instead of mounting the snapshot directly (which is read-only and can be tricky with permissions), I clone it to a temporary dataset:
zfs clone pool/dataset@snapshot-name pool/verify-temp
This creates a writable clone that I can mount and test without affecting the original snapshot. The clone is thin-provisioned, so it doesn’t immediately consume extra space.
Step 3: Mounting in a Docker Container
I created a simple Docker container based on Alpine Linux with a few utilities installed: rsync, sha256sum, and find. The container mounts the cloned dataset via NFS from TrueNAS.
I already had NFS shares configured on TrueNAS for other purposes, so I just added the temporary dataset to the allowed exports during the verification run. The Docker container mounts it like this:
docker run --rm
-v /mnt/nfs-verify:/data:ro
alpine-verify:latest
/scripts/verify.sh
The verify.sh script inside the container does the actual integrity checking.
Step 4: Running Integrity Checks
The verification script does a few things:
- Counts the total number of files in the dataset
- Checks for any files with unexpected permissions or ownership (this caught a misconfiguration once)
- Runs checksums on a sample of files to ensure they’re readable and not corrupted
- Attempts to open and read a few known critical files (like database dumps or config files)
I don’t checksum every single file because that would take too long on large datasets. Instead, I sample a percentage of files randomly and verify those. If the sample passes, I have reasonable confidence the rest is intact.
For datasets that contain database backups, I also added a step to restore the backup into a temporary PostgreSQL or MySQL container and run a basic query. This ensures the backup file isn’t just present, but actually usable.
Step 5: Cleanup and Logging
After the checks complete, the script logs the results to a file and sends a summary to a monitoring endpoint I have set up (a simple webhook that posts to a Slack channel). If any check fails, the alert is immediate.
Then I clean up:
zfs destroy pool/verify-temp
The clone is deleted, and the space is reclaimed. The whole process takes between 5 and 20 minutes depending on the dataset size.
What Worked
This setup has been running for about six months now, and it’s caught two real issues:
The first was a snapshot of a dataset where the underlying data had been corrupted before the snapshot was taken. The files were there, but they were unreadable. I only discovered this because the verification script tried to checksum them and failed. I was able to restore from an older snapshot before the corruption happened.
The second issue was a database backup that was incomplete. The backup script had silently failed partway through, but the file was still created with a normal-looking filename. The restore test inside the Docker container failed immediately, and I fixed the backup script.
Both of these would have been disasters if I’d only found out during an actual restore scenario.
Why Docker Containers Work Well for This
Using Docker containers for the verification process keeps everything isolated. I can spin up a clean environment, run the checks, and tear it down without leaving any residue on the host system. It also makes it easy to test different types of data—I have separate container images for verifying database backups, media files, and configuration archives.
The containers are lightweight and start quickly, which is important when you’re running these checks on a schedule.
What Didn’t Work
My first attempt tried to mount the ZFS snapshot directly inside the Docker container using a bind mount. This was a mess. Permissions were wrong, the mount was read-only in ways that confused some of my verification tools, and I kept running into issues with stale NFS handles.
Cloning the snapshot to a temporary dataset solved all of these problems. It’s a bit more overhead, but it’s worth it for the reliability.
I also initially tried to verify every file in the dataset, which was way too slow. On a dataset with hundreds of thousands of small files, the verification run would take hours. Switching to a sampling approach (checking 5-10% of files randomly) gave me good enough confidence in a fraction of the time.
Another mistake: I didn’t set up proper alerting at first. The script would run, log results to a file, and I’d forget to check it. Adding the webhook to Slack made a huge difference—I actually see the results now, and I know immediately if something fails.
Key Takeaways
Snapshots are not backups until you’ve verified you can restore from them. Automating that verification is the only way to be sure it actually happens.
Cloning snapshots to temporary datasets is cleaner and more reliable than trying to mount them directly, especially when you need to run tests that expect normal filesystem behavior.
You don’t need to verify every byte of every file. A well-chosen sample plus targeted checks on critical files gives you enough confidence without burning hours of compute time.
Docker containers are a good fit for this kind of work. They’re disposable, isolated, and easy to customize for different types of data.
If you’re not testing your restores, you’re not really backed up. This pipeline gave me the confidence I was missing.