Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Creating a bash script dashboard to monitor zfs pool health, smart disk status, and ups battery levels with prometheus textfile collector

Why I Built a Bash Dashboard for ZFS, SMART, and UPS Monitoring

I run a Proxmox server with ZFS pools that hold everything from VM images to backups. Over time, I realized I was checking pool health manually, running smartctl commands to see disk errors, and occasionally forgetting to verify the UPS battery status until something went wrong.

What I needed was a single script that could run on a schedule, collect all these metrics, and feed them into Prometheus using the textfile collector. That way, I could build Grafana dashboards and set up alerts without writing custom exporters or installing bloated monitoring agents.

This article walks through how I built that script, what worked, what didn't, and what I learned about exposing system health data in a format Prometheus can actually use.

My Real Setup

Here's what I'm working with:

  • Proxmox VE 8.x running on bare metal
  • Two ZFS pools: one mirrored for VMs, one RAIDZ2 for backups
  • Four spinning disks and two SSDs, all monitored via SMART
  • A CyberPower UPS connected via USB, managed by apcupsd
  • Prometheus running in a Docker container on the same host
  • Node Exporter's textfile collector enabled, pointing to /var/lib/node_exporter/textfile_collector

I wanted one script that could run every 5 minutes via cron, write metrics to a .prom file, and let Prometheus scrape them alongside the standard node metrics.

What the Script Does

The script checks three things:

1. ZFS Pool Health

I use zpool status to extract pool state, capacity, and error counts. The script parses the output and converts it into Prometheus metrics like:

zfs_pool_health{pool="tank"} 1
zfs_pool_capacity_percent{pool="tank"} 67
zfs_pool_read_errors{pool="tank"} 0
zfs_pool_write_errors{pool="tank"} 0
zfs_pool_checksum_errors{pool="tank"} 0

I assign 1 for healthy pools and 0 for anything degraded or faulted. This makes it easy to alert when a pool drops to 0.

2. SMART Disk Status

For each disk, I run smartctl -H to check overall health and smartctl -A to pull specific attributes like reallocated sectors, temperature, and power-on hours.

The metrics look like this:

smart_disk_health{device="/dev/sda"} 1
smart_reallocated_sectors{device="/dev/sda"} 0
smart_temperature_celsius{device="/dev/sda"} 38
smart_power_on_hours{device="/dev/sda"} 12456

I only track attributes that matter for predicting failures. I don't scrape every SMART field—just the ones that have actually warned me about problems in the past.

3. UPS Battery Status

I use apcaccess to pull UPS data and extract battery charge percentage, load percentage, and runtime remaining:

ups_battery_charge_percent 95
ups_load_percent 23
ups_runtime_minutes 42
ups_status{status="ONLINE"} 1

The ups_status metric uses a label to indicate whether the UPS is online, on battery, or in some other state. I set it to 1 when online and 0 otherwise.

How I Structured the Script

The script is a single bash file that writes to a temporary file and then atomically moves it to the final .prom location. This prevents Prometheus from scraping incomplete data.

Here's the basic structure:

#!/bin/bash

OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/system_health.prom"
TEMP_FILE="${OUTPUT_FILE}.tmp"

# Clear temp file
> "$TEMP_FILE"

# Collect ZFS metrics
for pool in $(zpool list -H -o name); do
  # Parse zpool status and write metrics
done

# Collect SMART metrics
for disk in /dev/sd?; do
  # Run smartctl and parse output
done

# Collect UPS metrics
# Parse apcaccess output

# Move temp file to final location
mv "$TEMP_FILE" "$OUTPUT_FILE"

I use awk and grep to parse command output. No Python, no jq—just standard Unix tools that are already on the system.

What Worked

Textfile Collector is Simple

The textfile collector is the easiest way to get custom metrics into Prometheus. You just write a file in the right format, and Prometheus picks it up on the next scrape. No API calls, no network listeners, no daemon management.

Atomic File Writes Prevent Partial Scrapes

Writing to a temp file and using mv ensures Prometheus never sees half-written data. This is critical when the script takes a few seconds to run.

Parsing ZFS Output is Straightforward

The zpool status output is consistent and easy to parse with awk. I extract the pool name, state, capacity percentage, and error counts in a few lines.

SMART Data is Rich but Noisy

SMART attributes vary by manufacturer, so I only track the ones that are universally useful: reallocated sectors, current pending sectors, temperature, and power-on hours. Trying to track everything just clutters the metrics.

What Didn't Work

SMART Parsing is Fragile

Different disk models report SMART data in slightly different formats. I had to add conditional logic to handle cases where certain attributes are missing or reported differently.

For example, some SSDs don't report temperature in the same field as spinning disks. I ended up checking multiple attribute IDs and falling back to a default value if none are present.

ZFS Scrub Timing is Harder Than It Looks

I initially tried to track time since the last scrub by parsing zpool status. The output includes a scrub timestamp, but it's in a human-readable format that requires date parsing.

I converted it to a Unix timestamp and calculated the difference, but the logic was messy. I eventually decided to just expose the scrub completion time as a metric and let Prometheus handle the calculation.

UPS Data Requires apcupsd

This script assumes you're using apcupsd. If you're using a different UPS management tool, you'll need to adjust the parsing logic. I don't have a universal solution for this.

How I Run the Script

I added a cron job to run the script every 5 minutes:

*/5 * * * * /usr/local/bin/system_health_monitor.sh

I set the script to run as root because it needs access to zpool, smartctl, and apcaccess. I could probably tighten permissions with sudo rules, but for a single-user system, this is fine.

Metrics I Actually Use

Once the metrics are in Prometheus, I built a Grafana dashboard with these panels:

  • ZFS pool health status (green/red indicator)
  • Pool capacity over time (line graph)
  • Disk temperature trends (multi-line graph)
  • Reallocated sector counts (table)
  • UPS battery charge and runtime (gauge)

I set up alerts for:

  • Any pool not in a healthy state
  • Pool capacity above 80%
  • Any disk with non-zero reallocated sectors
  • UPS battery below 50% while on battery

These alerts have already caught two issues: a disk that started showing reallocated sectors and a UPS battery that was slowly degrading.

Key Takeaways

The textfile collector is the right tool for this. It's simple, reliable, and doesn't require running a separate exporter process.

Parsing system output is tedious but doable. I spent more time handling edge cases than writing the core logic. Test your script on different disk models and pool configurations.

Only track metrics you'll actually use. I initially tried to expose every SMART attribute and ended up with hundreds of useless metrics. Focus on what predicts failures.

Atomic file writes matter. Don't skip the temp file step. Prometheus will scrape partial data if you write directly to the output file.

Run the script frequently. Five minutes is a good interval for health checks. Anything longer and you risk missing transient errors.

Limitations and Trade-offs

This script doesn't handle:

  • Multiple UPS units (I only have one)
  • Non-ZFS filesystems (it's ZFS-specific)
  • Disks without SMART support (rare, but possible)
  • Advanced ZFS features like scrub progress or ARC stats (I don't need them yet)

It also assumes you're running Prometheus on the same host or have network access to the textfile collector directory. If you're scraping remotely, you'll need to adjust the setup.

What I'd Change

If I were starting over, I'd probably use a proper configuration file instead of hardcoding paths and thresholds in the script. Right now, if I want to change the output directory or add a new disk, I have to edit the script directly.

I'd also add a dry-run mode that prints metrics to stdout instead of writing to a file. This would make testing easier.

Finally, I'd document the expected SMART attribute IDs more clearly. Right now, the script just checks for common IDs, but it's not obvious which disks support which attributes.

Final Thoughts

This script has been running on my Proxmox server for several months now. It's simple, it works, and it's given me visibility into system health without adding complexity.

If you're running ZFS and already have Prometheus set up, this approach is worth trying. You don't need custom exporters or third-party tools—just a bash script and the textfile collector.

The full script is too long to include here, but the structure I've outlined should be enough to build your own version. Focus on the metrics that matter for your setup, test the parsing logic thoroughly, and keep it simple.