Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Debugging systemd timer drift in long-running bash scripts that sync data across air-gapped networks

Why I worked on this

I run a data sync process between two physically separated networks that can’t talk to each other directly. One side collects logs and metrics, the other side processes them. The transfer happens via USB drives that get physically moved between sites.

I set up a systemd timer to run the sync script every 6 hours. After a few weeks, I noticed the timer was firing at increasingly unpredictable times. What started as a clean 00:00, 06:00, 12:00, 18:00 schedule drifted to 00:14, 06:31, 12:47, and so on. By the time I caught it, some runs were overlapping with the next scheduled trigger.

The script itself takes anywhere from 20 minutes to 2 hours depending on how much data piled up. I needed to understand why systemd wasn’t accounting for this properly.

My real setup

The sync script is a bash script that:

  • Mounts a USB drive
  • Runs rsync to copy data from a staging directory
  • Generates checksums
  • Logs the transfer
  • Unmounts cleanly

The systemd timer was configured like this:

[Unit]
Description=Air-gap data sync timer

[Timer]
OnBootSec=10min
OnUnitActiveSec=6h
Unit=airgap-sync.service

[Install]
WantedBy=timers.target

The service file was straightforward:

[Unit]
Description=Air-gap data sync

[Service]
Type=oneshot
ExecStart=/usr/local/bin/airgap-sync.sh
StandardOutput=journal
StandardError=journal

I assumed OnUnitActiveSec=6h meant “run this every 6 hours after the last run finished.” That assumption was wrong.

What didn’t work

The drift happened because OnUnitActiveSec measures time from when the service activated, not when it finished. If the script takes 90 minutes to complete, systemd still counts the next 6-hour window starting from when the service was triggered, not from when it exited.

This means:

  • Run 1 starts at 00:00, finishes at 01:30
  • Run 2 should start at 06:00 (6 hours after 00:00)
  • But if Run 1 took longer than expected, Run 2 might start before Run 1 even finishes

I tried adding AccuracySec=1s thinking it would tighten the timing. It didn’t help. The drift wasn’t about precision—it was about how the interval was calculated.

I also tried using OnCalendar with fixed times like OnCalendar=00:00,06:00,12:00,18:00. This worked better for predictability, but introduced a new problem: if a run was still going when the next calendar time hit, systemd would skip that trigger entirely. I’d end up with missed syncs and no clear indication in the logs that it happened.

What worked

The fix was switching to OnUnitInactiveSec instead of OnUnitActiveSec.

[Timer]
OnBootSec=10min
OnUnitInactiveSec=6h
Unit=airgap-sync.service

This tells systemd to wait 6 hours after the service deactivates (exits) before triggering it again. If the script takes 90 minutes, the next run starts 6 hours after that 90-minute mark, not 6 hours from when it started.

I also added a check in the service file to prevent overlapping runs:

[Service]
Type=oneshot
ExecStartPre=/usr/bin/flock -n /var/lock/airgap-sync.lock -c true
ExecStart=/usr/local/bin/airgap-sync.sh
StandardOutput=journal
StandardError=journal

The flock command tries to acquire a lock file. If another instance is running, it fails immediately and the service doesn’t start. This gave me a clean way to handle edge cases where something else triggered the service manually.

Inside the bash script, I added explicit logging of start and end times:

#!/bin/bash

START_TIME=$(date +%s)
echo "Sync started at $(date)"

# ... actual sync logic ...

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
echo "Sync completed in ${DURATION} seconds"

This let me track drift patterns in the journal and confirm the timer was behaving correctly after the change.

What I learned about systemd timers

OnUnitActiveSec is useful for tasks that finish quickly and predictably. For long-running or variable-length jobs, it creates drift because the clock starts ticking before the work is done.

OnUnitInactiveSec is the right choice when you want consistent spacing between completed runs, not between start times.

OnCalendar is clean for fixed schedules, but it doesn’t handle long-running tasks gracefully. If a run is still active when the next calendar event fires, that trigger is lost. There’s no queue or retry.

Systemd doesn’t prevent overlapping service runs by default. You have to enforce that yourself, either with flock or by checking for existing processes in your script.

The journal is your friend. I started checking systemctl status airgap-sync.timer regularly to see the “Last trigger” and “Next elapse” times. When those values stopped making sense relative to each other, I knew something was wrong.

Key takeaways

If your systemd timer drifts over time, check whether you’re using the right time reference. OnUnitActiveSec measures from service start, OnUnitInactiveSec measures from service end.

Long-running scripts need protection against overlapping runs. A simple lock file with flock is enough for most cases.

Logging execution time inside your script makes debugging timer behavior much easier. You can’t fix what you can’t measure.

Air-gapped workflows are fragile by nature. The fewer assumptions you make about timing and state, the fewer surprises you’ll hit when something takes longer than expected.

Leave a Comment

Your email address will not be published. Required fields are marked *