Why I Built Circuit Breakers for Home Automation APIs
My home automation setup talks to a bunch of different APIs—some running on local containers, others hitting cloud services for weather data or device control. The problem is that these endpoints fail. A lot.
Sometimes my n8n workflows would hang for minutes waiting on a timeout. Other times a flaky Zigbee bridge would return 500 errors for hours, and my scripts would keep hammering it uselessly. I needed a way to stop wasting time on known-bad endpoints and fall back to alternatives when things broke.
That's when I started building bash-based circuit breakers with exponential backoff. Not because it's elegant, but because my automation runs in cron jobs and systemd timers where bash is already there.
My Real Setup
I run most of my automation through:
- n8n workflows (self-hosted in Docker on Proxmox)
- Bash scripts triggered by systemd timers
- Cronicle jobs for scheduled tasks
These scripts call APIs like:
- Home Assistant (local, usually stable)
- Zigbee2MQTT bridge (local, occasionally crashes)
- OpenWeatherMap (external, rate-limited)
- Custom sensors on ESP32 devices (flaky network)
When an API goes down, I don't want my entire automation chain to freeze. I want it to back off, try a fallback if available, and log what happened so I can fix it later.
What I Built
Basic Exponential Backoff
The first thing I implemented was simple exponential backoff. Here's the actual function I use:
retry_with_backoff() {
local max_attempts=$1
local cmd="${@:2}"
local delay=2
local attempt=1
while [ $attempt -le $max_attempts ]; do
echo "[$(date +%H:%M:%S)] Attempt $attempt/$max_attempts"
if eval "$cmd"; then
echo "[$(date +%H:%M:%S)] Success"
return 0
fi
if [ $attempt -lt $max_attempts ]; then
echo "[$(date +%H:%M:%S)] Failed. Waiting ${delay}s..."
sleep $delay
delay=$((delay * 2))
fi
attempt=$((attempt + 1))
done
echo "[$(date +%H:%M:%S)] All attempts failed"
return 1
}
I use this for calling my Zigbee2MQTT API:
retry_with_backoff 4 "curl -sf http://192.168.1.50:8080/api/devices"
This tries 4 times with delays of 2s, 4s, 8s between attempts. If the bridge is restarting, it usually comes back within 10-15 seconds, so this pattern works.
Adding Jitter
When multiple scripts run at the same time (like when a systemd timer fires several jobs), they all retry on the same schedule. This creates thundering herd problems where everything hits the API simultaneously after each backoff.
I added random jitter to spread out the retries:
retry_with_jitter() {
local max_attempts=$1
local cmd="${@:2}"
local base_delay=2
local attempt=1
while [ $attempt -le $max_attempts ]; do
echo "[$(date +%H:%M:%S)] Attempt $attempt/$max_attempts"
if eval "$cmd"; then
return 0
fi
if [ $attempt -lt $max_attempts ]; then
local delay=$((base_delay * (2 ** (attempt - 1))))
local jitter=$((RANDOM % 3))
local total_delay=$((delay + jitter))
echo "[$(date +%H:%M:%S)] Waiting ${total_delay}s..."
sleep $total_delay
fi
attempt=$((attempt + 1))
done
return 1
}
The jitter is small (0-2 seconds) but it's enough to prevent exact synchronization when I have 5-6 scripts all retrying the same endpoint.
Circuit Breaker State File
The real circuit breaker pattern requires remembering state across script runs. If an endpoint fails consistently, I don't want every new script execution to waste time retrying it.
I use a simple state file approach:
CIRCUIT_STATE_DIR="/var/run/automation-circuits"
mkdir -p "$CIRCUIT_STATE_DIR"
circuit_is_open() {
local endpoint=$1
local state_file="$CIRCUIT_STATE_DIR/$(echo $endpoint | md5sum | cut -d' ' -f1)"
if [ -f "$state_file" ]; then
local opened_at=$(cat "$state_file")
local now=$(date +%s)
local elapsed=$((now - opened_at))
# Circuit stays open for 5 minutes
if [ $elapsed -lt 300 ]; then
echo "[$(date +%H:%M:%S)] Circuit open for $endpoint (${elapsed}s ago)"
return 0
else
rm "$state_file"
fi
fi
return 1
}
circuit_open() {
local endpoint=$1
local state_file="$CIRCUIT_STATE_DIR/$(echo $endpoint | md5sum | cut -d' ' -f1)"
echo "[$(date +%H:%M:%S)] Opening circuit for $endpoint"
date +%s > "$state_file"
}
circuit_close() {
local endpoint=$1
local state_file="$CIRCUIT_STATE_DIR/$(echo $endpoint | md5sum | cut -d' ' -f1)"
if [ -f "$state_file" ]; then
echo "[$(date +%H:%M:%S)] Closing circuit for $endpoint"
rm "$state_file"
fi
}
Now I can wrap API calls with circuit breaker logic:
call_with_circuit() {
local endpoint=$1
local cmd="${@:2}"
if circuit_is_open "$endpoint"; then
return 1
fi
if retry_with_jitter 3 "$cmd"; then
circuit_close "$endpoint"
return 0
else
circuit_open "$endpoint"
return 1
fi
}
Fallback Endpoints
For critical data like weather, I have multiple API sources. If OpenWeatherMap is down or rate-limited, I fall back to wttr.in:
get_weather_data() {
local primary="https://api.openweathermap.org/data/2.5/weather?q=MyCity&appid=$API_KEY"
local fallback="https://wttr.in/MyCity?format=j1"
if call_with_circuit "openweather" "curl -sf '$primary'"; then
echo "[$(date +%H:%M:%S)] Got weather from primary"
return 0
fi
echo "[$(date +%H:%M:%S)] Primary failed, trying fallback"
if call_with_circuit "wttr" "curl -sf '$fallback'"; then
echo "[$(date +%H:%M:%S)] Got weather from fallback"
return 0
fi
echo "[$(date +%H:%M:%S)] All weather sources failed"
return 1
}
This saved me during an OpenWeatherMap outage last month. My morning automation kept working because it silently switched to the backup source.
What Worked
Exponential backoff with jitter stopped my scripts from creating retry storms. When my Zigbee bridge crashes and restarts, the randomized delays mean requests trickle in instead of slamming it all at once.
Circuit breaker state files prevent wasted attempts. If my ESP32 sensor is offline, the circuit opens and subsequent scripts skip it immediately instead of waiting through timeouts.
Fallback endpoints made my automation more resilient. Weather data, in particular, has multiple free sources that work well enough for my use case.
Logging timestamps helped me debug patterns. I could see that my Zigbee bridge was crashing every morning at 6:15 AM (turns out a specific automation was overloading it).
What Didn't Work
Too-short circuit timeouts caused problems initially. I started with 60-second circuit breaks, but some services (like my Zigbee bridge after a crash) need 2-3 minutes to fully restart. I settled on 5 minutes as a reasonable default.
MD5 hashing endpoint URLs for state files seemed clever but made debugging harder. I couldn't tell at a glance which circuit was open. I should have used sanitized endpoint names instead.
No cleanup of old state files meant /var/run/automation-circuits accumulated junk. I added a weekly cleanup job to remove files older than 24 hours.
Hardcoded retry counts were inflexible. Some endpoints need 2 tries, others need 5. I should have made max_attempts configurable per endpoint.
Silent failures in fallback chains hid problems. If both primary and fallback failed, I had no visibility into why. I added explicit logging for each failure reason.
Key Takeaways
Circuit breakers in bash aren't elegant, but they work for home automation where you're dealing with unreliable local hardware and flaky APIs.
Exponential backoff with jitter is mandatory if you have multiple scripts hitting the same endpoints. Without it, you create synchronized retry storms.
State files in /var/run work fine for tracking circuit state across script executions. Just remember to clean them up periodically.
Fallback endpoints are worth setting up for critical data sources. Weather APIs, in particular, have plenty of free alternatives.
Log everything with timestamps. When debugging why your automation failed at 3 AM, you need to know exactly which endpoint was down and for how long.
The biggest win wasn't preventing failures—it was failing faster and more gracefully. My scripts now complete in seconds instead of hanging for minutes on dead endpoints.