Building a Bash-Based AI Model Router: Load Balancing Between Local Ollama and Cloud APIs with Automatic Failover

Why I Built This

I run multiple AI models across different machines in my home lab. Some run locally on Proxmox VMs with GPU passthrough, others on a Mac Mini with Apple Silicon, and I occasionally fall back to cloud APIs when local resources are maxed out. The problem was simple: managing all these endpoints manually became exhausting.

Every time I wanted to query a model, I had to remember which machine was running what, check if it was available, and manually switch endpoints if something was down. I needed a single entry point that could route requests intelligently without me having to think about it.

What I Actually Built

I wrote a bash script that acts as a lightweight AI model router. It sits in front of my local Ollama instances and cloud API endpoints, handles load balancing, and automatically fails over when something breaks.

The core logic is straightforward:

Maintain a list of backend endpoints (local Ollama servers, OpenAI-compatible APIs, etc.)
Health check each endpoint periodically
Route incoming requests based on availability and response time
Fall back to the next available endpoint if one fails

I chose bash because I already had it running on the VM that handles my internal routing, and I didn’t want to introduce another runtime dependency. The script is small enough to debug quickly and modify on the fly.

The Setup

My current configuration includes:

Two Proxmox VMs running Ollama with different models loaded
One Mac Mini running llama.cpp for Metal-optimized inference
OpenAI API as a fallback for when local resources are busy

The router runs as a systemd service on a lightweight Alpine Linux VM. It exposes a single HTTP endpoint that accepts OpenAI-compatible API requests and forwards them to the appropriate backend.

Core Script Logic

The health check function pings each backend every 30 seconds and records response times:

check_health() {
  local endpoint=$1
  local start=$(date +%s%N)
  
  if curl -sf -m 5 "${endpoint}/api/tags" > /dev/null 2>&1; then
    local end=$(date +%s%N)
    local latency=$(( (end - start) / 1000000 ))
    echo "${endpoint}:${latency}" >> /tmp/router_health
  else
    echo "${endpoint}:9999" >> /tmp/router_health
  fi
}

The routing function selects the fastest available backend:

select_backend() {
  sort -t: -k2 -n /tmp/router_health | head -1 | cut -d: -f1
}

When a request comes in, the script forwards it to the selected backend and retries on failure:

handle_request() {
  local backend=$(select_backend)
  local response=$(curl -sf -X POST "${backend}/api/generate" 
    -H "Content-Type: application/json" 
    -d @-)
  
  if [ $? -ne 0 ]; then
    # Mark backend as unhealthy and retry
    echo "${backend}:9999" >> /tmp/router_health
    backend=$(select_backend)
    response=$(curl -sf -X POST "${backend}/api/generate" 
      -H "Content-Type: application/json" 
      -d @-)
  fi
  
  echo "$response"
}

What Worked

The script has been running for three months without manual intervention. It correctly routes requests to the fastest available backend and fails over when one goes down (usually due to model loading or VM maintenance).

Using bash kept the deployment simple. I can SSH into the router VM, edit the script, and restart the service in under a minute. No build steps, no dependency conflicts.

The health check approach is crude but effective. By recording response times, the router naturally prefers faster backends without needing complex load balancing algorithms.

Fallback to OpenAI works transparently. When all local backends are unavailable, requests automatically route to the cloud API. I monitor this through logs to track when I’m burning API credits versus using local compute.

What Didn’t Work

Initial attempts to use round-robin load balancing failed because different models have vastly different response times. A simple rotation meant slow requests would pile up on already-busy backends.

I tried implementing request queuing to prevent overloading backends, but bash isn’t great at concurrent request handling. The queue logic became messy fast, and I removed it in favor of just letting requests fail and retry.

The health check interval is a compromise. 30 seconds means the router can route to a dead backend for up to half a minute before detecting the failure. I experimented with shorter intervals, but the constant health check traffic started affecting model response times on lower-spec VMs.

Streaming responses don’t work well through this setup. The bash script buffers the entire response before forwarding it, which breaks the streaming experience. For now, I route streaming requests directly to backends, bypassing the router.

Cloud API Cost Tracking

I added basic logging to track when requests hit the OpenAI fallback, but I don’t have good cost visibility yet. The logs show request counts, but I have to manually correlate them with my OpenAI dashboard to see actual spending. This is something I need to improve.

Key Takeaways

A simple bash-based router solved my immediate problem without adding complexity. It’s not elegant, but it works reliably for my use case.

Health checks based on response time are more useful than binary up/down checks. Knowing which backend is faster matters more than just knowing it’s alive.

Automatic failover is worth the effort. Not having to manually switch endpoints when a VM goes down for maintenance has saved me significant time and frustration.

Bash has real limitations for this kind of work. If I need better streaming support or more sophisticated load balancing, I’ll likely rewrite this in Go or Python. But for now, the simplicity is worth the trade-offs.

The biggest win is having a single endpoint for all my AI model interactions. Whether I’m testing a prompt locally or running a batch job that needs cloud resources, the same API call works everywhere.

Tech Expert & Vibe Coder

Why I Built This

What I Actually Built

The Setup

Core Script Logic

What Worked

What Didn’t Work

Cloud API Cost Tracking

Key Takeaways

Category:

Setting Up Automated Home...

Implementing Prometheus...

Leave a Comment Cancel reply

Categories

Related Posts

Setting Up Automated Home Assistant Backup...

Implementing Prometheus Alerting for Silent Cron...

Building N8n Workflows for Cross-platform Package...

About Me

Vipin PG

Tech Expert & Vibe Coder

Building a Bash-Based AI Model Router: Load Balancing Between Local Ollama and Cloud APIs with Automatic Failover

Why I Built This

What I Actually Built

The Setup

Core Script Logic

What Worked

What Didn’t Work

Cloud API Cost Tracking

Key Takeaways

Category:

Setting Up Automated Home...

Implementing Prometheus...

Leave a Comment Cancel reply

Subscribe to Newsletter

Categories

Related Posts

Setting Up Automated Home Assistant Backup...

Implementing Prometheus Alerting for Silent Cron...

Building N8n Workflows for Cross-platform Package...

About Me

Vipin PG