Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Implementing bash script circuit breakers for unreliable home automation apis: exponential backoff and fallback endpoints

Why I Built Circuit Breakers for Home Automation APIs

My home automation setup talks to a bunch of different APIs—some running on local containers, others hitting cloud services for weather data or device control. The problem is that these endpoints fail. A lot.

Sometimes my n8n workflows would hang for minutes waiting on a timeout. Other times a flaky Zigbee bridge would return 500 errors for hours, and my scripts would keep hammering it uselessly. I needed a way to stop wasting time on known-bad endpoints and fall back to alternatives when things broke.

That's when I started building bash-based circuit breakers with exponential backoff. Not because it's elegant, but because my automation runs in cron jobs and systemd timers where bash is already there.

My Real Setup

I run most of my automation through:

  • n8n workflows (self-hosted in Docker on Proxmox)
  • Bash scripts triggered by systemd timers
  • Cronicle jobs for scheduled tasks

These scripts call APIs like:

  • Home Assistant (local, usually stable)
  • Zigbee2MQTT bridge (local, occasionally crashes)
  • OpenWeatherMap (external, rate-limited)
  • Custom sensors on ESP32 devices (flaky network)

When an API goes down, I don't want my entire automation chain to freeze. I want it to back off, try a fallback if available, and log what happened so I can fix it later.

What I Built

Basic Exponential Backoff

The first thing I implemented was simple exponential backoff. Here's the actual function I use:

retry_with_backoff() {
  local max_attempts=$1
  local cmd="${@:2}"
  local delay=2
  local attempt=1

  while [ $attempt -le $max_attempts ]; do
    echo "[$(date +%H:%M:%S)] Attempt $attempt/$max_attempts"
    
    if eval "$cmd"; then
      echo "[$(date +%H:%M:%S)] Success"
      return 0
    fi
    
    if [ $attempt -lt $max_attempts ]; then
      echo "[$(date +%H:%M:%S)] Failed. Waiting ${delay}s..."
      sleep $delay
      delay=$((delay * 2))
    fi
    
    attempt=$((attempt + 1))
  done
  
  echo "[$(date +%H:%M:%S)] All attempts failed"
  return 1
}

I use this for calling my Zigbee2MQTT API:

retry_with_backoff 4 "curl -sf http://192.168.1.50:8080/api/devices"

This tries 4 times with delays of 2s, 4s, 8s between attempts. If the bridge is restarting, it usually comes back within 10-15 seconds, so this pattern works.

Adding Jitter

When multiple scripts run at the same time (like when a systemd timer fires several jobs), they all retry on the same schedule. This creates thundering herd problems where everything hits the API simultaneously after each backoff.

I added random jitter to spread out the retries:

retry_with_jitter() {
  local max_attempts=$1
  local cmd="${@:2}"
  local base_delay=2
  local attempt=1

  while [ $attempt -le $max_attempts ]; do
    echo "[$(date +%H:%M:%S)] Attempt $attempt/$max_attempts"
    
    if eval "$cmd"; then
      return 0
    fi
    
    if [ $attempt -lt $max_attempts ]; then
      local delay=$((base_delay * (2 ** (attempt - 1))))
      local jitter=$((RANDOM % 3))
      local total_delay=$((delay + jitter))
      
      echo "[$(date +%H:%M:%S)] Waiting ${total_delay}s..."
      sleep $total_delay
    fi
    
    attempt=$((attempt + 1))
  done
  
  return 1
}

The jitter is small (0-2 seconds) but it's enough to prevent exact synchronization when I have 5-6 scripts all retrying the same endpoint.

Circuit Breaker State File

The real circuit breaker pattern requires remembering state across script runs. If an endpoint fails consistently, I don't want every new script execution to waste time retrying it.

I use a simple state file approach:

CIRCUIT_STATE_DIR="/var/run/automation-circuits"
mkdir -p "$CIRCUIT_STATE_DIR"

circuit_is_open() {
  local endpoint=$1
  local state_file="$CIRCUIT_STATE_DIR/$(echo $endpoint | md5sum | cut -d' ' -f1)"
  
  if [ -f "$state_file" ]; then
    local opened_at=$(cat "$state_file")
    local now=$(date +%s)
    local elapsed=$((now - opened_at))
    
    # Circuit stays open for 5 minutes
    if [ $elapsed -lt 300 ]; then
      echo "[$(date +%H:%M:%S)] Circuit open for $endpoint (${elapsed}s ago)"
      return 0
    else
      rm "$state_file"
    fi
  fi
  
  return 1
}

circuit_open() {
  local endpoint=$1
  local state_file="$CIRCUIT_STATE_DIR/$(echo $endpoint | md5sum | cut -d' ' -f1)"
  
  echo "[$(date +%H:%M:%S)] Opening circuit for $endpoint"
  date +%s > "$state_file"
}

circuit_close() {
  local endpoint=$1
  local state_file="$CIRCUIT_STATE_DIR/$(echo $endpoint | md5sum | cut -d' ' -f1)"
  
  if [ -f "$state_file" ]; then
    echo "[$(date +%H:%M:%S)] Closing circuit for $endpoint"
    rm "$state_file"
  fi
}

Now I can wrap API calls with circuit breaker logic:

call_with_circuit() {
  local endpoint=$1
  local cmd="${@:2}"
  
  if circuit_is_open "$endpoint"; then
    return 1
  fi
  
  if retry_with_jitter 3 "$cmd"; then
    circuit_close "$endpoint"
    return 0
  else
    circuit_open "$endpoint"
    return 1
  fi
}

Fallback Endpoints

For critical data like weather, I have multiple API sources. If OpenWeatherMap is down or rate-limited, I fall back to wttr.in:

get_weather_data() {
  local primary="https://api.openweathermap.org/data/2.5/weather?q=MyCity&appid=$API_KEY"
  local fallback="https://wttr.in/MyCity?format=j1"
  
  if call_with_circuit "openweather" "curl -sf '$primary'"; then
    echo "[$(date +%H:%M:%S)] Got weather from primary"
    return 0
  fi
  
  echo "[$(date +%H:%M:%S)] Primary failed, trying fallback"
  
  if call_with_circuit "wttr" "curl -sf '$fallback'"; then
    echo "[$(date +%H:%M:%S)] Got weather from fallback"
    return 0
  fi
  
  echo "[$(date +%H:%M:%S)] All weather sources failed"
  return 1
}

This saved me during an OpenWeatherMap outage last month. My morning automation kept working because it silently switched to the backup source.

What Worked

Exponential backoff with jitter stopped my scripts from creating retry storms. When my Zigbee bridge crashes and restarts, the randomized delays mean requests trickle in instead of slamming it all at once.

Circuit breaker state files prevent wasted attempts. If my ESP32 sensor is offline, the circuit opens and subsequent scripts skip it immediately instead of waiting through timeouts.

Fallback endpoints made my automation more resilient. Weather data, in particular, has multiple free sources that work well enough for my use case.

Logging timestamps helped me debug patterns. I could see that my Zigbee bridge was crashing every morning at 6:15 AM (turns out a specific automation was overloading it).

What Didn't Work

Too-short circuit timeouts caused problems initially. I started with 60-second circuit breaks, but some services (like my Zigbee bridge after a crash) need 2-3 minutes to fully restart. I settled on 5 minutes as a reasonable default.

MD5 hashing endpoint URLs for state files seemed clever but made debugging harder. I couldn't tell at a glance which circuit was open. I should have used sanitized endpoint names instead.

No cleanup of old state files meant /var/run/automation-circuits accumulated junk. I added a weekly cleanup job to remove files older than 24 hours.

Hardcoded retry counts were inflexible. Some endpoints need 2 tries, others need 5. I should have made max_attempts configurable per endpoint.

Silent failures in fallback chains hid problems. If both primary and fallback failed, I had no visibility into why. I added explicit logging for each failure reason.

Key Takeaways

Circuit breakers in bash aren't elegant, but they work for home automation where you're dealing with unreliable local hardware and flaky APIs.

Exponential backoff with jitter is mandatory if you have multiple scripts hitting the same endpoints. Without it, you create synchronized retry storms.

State files in /var/run work fine for tracking circuit state across script executions. Just remember to clean them up periodically.

Fallback endpoints are worth setting up for critical data sources. Weather APIs, in particular, have plenty of free alternatives.

Log everything with timestamps. When debugging why your automation failed at 3 AM, you need to know exactly which endpoint was down and for how long.

The biggest win wasn't preventing failures—it was failing faster and more gracefully. My scripts now complete in seconds instead of hanging for minutes on dead endpoints.