Why I Built This
I run multiple AI models across different machines in my home lab. Some run locally on Proxmox VMs with GPU passthrough, others on a Mac Mini with Apple Silicon, and I occasionally fall back to cloud APIs when local resources are maxed out. The problem was simple: managing all these endpoints manually became exhausting.
Every time I wanted to query a model, I had to remember which machine was running what, check if it was available, and manually switch endpoints if something was down. I needed a single entry point that could route requests intelligently without me having to think about it.
What I Actually Built
I wrote a bash script that acts as a lightweight AI model router. It sits in front of my local Ollama instances and cloud API endpoints, handles load balancing, and automatically fails over when something breaks.
The core logic is straightforward:
- Maintain a list of backend endpoints (local Ollama servers, OpenAI-compatible APIs, etc.)
- Health check each endpoint periodically
- Route incoming requests based on availability and response time
- Fall back to the next available endpoint if one fails
I chose bash because I already had it running on the VM that handles my internal routing, and I didn't want to introduce another runtime dependency. The script is small enough to debug quickly and modify on the fly.
The Setup
My current configuration includes:
- Two Proxmox VMs running Ollama with different models loaded
- One Mac Mini running llama.cpp for Metal-optimized inference
- OpenAI API as a fallback for when local resources are busy
The router runs as a systemd service on a lightweight Alpine Linux VM. It exposes a single HTTP endpoint that accepts OpenAI-compatible API requests and forwards them to the appropriate backend.
Core Script Logic
The health check function pings each backend every 30 seconds and records response times:
check_health() {
local endpoint=$1
local start=$(date +%s%N)
if curl -sf -m 5 "${endpoint}/api/tags" > /dev/null 2>&1; then
local end=$(date +%s%N)
local latency=$(( (end - start) / 1000000 ))
echo "${endpoint}:${latency}" >> /tmp/router_health
else
echo "${endpoint}:9999" >> /tmp/router_health
fi
}
The routing function selects the fastest available backend:
select_backend() {
sort -t: -k2 -n /tmp/router_health | head -1 | cut -d: -f1
}
When a request comes in, the script forwards it to the selected backend and retries on failure:
handle_request() {
local backend=$(select_backend)
local response=$(curl -sf -X POST "${backend}/api/generate" \
-H "Content-Type: application/json" \
-d @-)
if [ $? -ne 0 ]; then
# Mark backend as unhealthy and retry
echo "${backend}:9999" >> /tmp/router_health
backend=$(select_backend)
response=$(curl -sf -X POST "${backend}/api/generate" \
-H "Content-Type: application/json" \
-d @-)
fi
echo "$response"
}
What Worked
The script has been running for three months without manual intervention. It correctly routes requests to the fastest available backend and fails over when one goes down (usually due to model loading or VM maintenance).
Using bash kept the deployment simple. I can SSH into the router VM, edit the script, and restart the service in under a minute. No build steps, no dependency conflicts.
The health check approach is crude but effective. By recording response times, the router naturally prefers faster backends without needing complex load balancing algorithms.
Fallback to OpenAI works transparently. When all local backends are unavailable, requests automatically route to the cloud API. I monitor this through logs to track when I'm burning API credits versus using local compute.
What Didn't Work
Initial attempts to use round-robin load balancing failed because different models have vastly different response times. A simple rotation meant slow requests would pile up on already-busy backends.
I tried implementing request queuing to prevent overloading backends, but bash isn't great at concurrent request handling. The queue logic became messy fast, and I removed it in favor of just letting requests fail and retry.
The health check interval is a compromise. 30 seconds means the router can route to a dead backend for up to half a minute before detecting the failure. I experimented with shorter intervals, but the constant health check traffic started affecting model response times on lower-spec VMs.
Streaming responses don't work well through this setup. The bash script buffers the entire response before forwarding it, which breaks the streaming experience. For now, I route streaming requests directly to backends, bypassing the router.
Cloud API Cost Tracking
I added basic logging to track when requests hit the OpenAI fallback, but I don't have good cost visibility yet. The logs show request counts, but I have to manually correlate them with my OpenAI dashboard to see actual spending. This is something I need to improve.
Key Takeaways
A simple bash-based router solved my immediate problem without adding complexity. It's not elegant, but it works reliably for my use case.
Health checks based on response time are more useful than binary up/down checks. Knowing which backend is faster matters more than just knowing it's alive.
Automatic failover is worth the effort. Not having to manually switch endpoints when a VM goes down for maintenance has saved me significant time and frustration.
Bash has real limitations for this kind of work. If I need better streaming support or more sophisticated load balancing, I'll likely rewrite this in Go or Python. But for now, the simplicity is worth the trade-offs.
The biggest win is having a single endpoint for all my AI model interactions. Whether I'm testing a prompt locally or running a batch job that needs cloud resources, the same API call works everywhere.