Why I Built This
I run AI workloads on my home lab—mostly local models through Ollama for privacy-sensitive tasks and cost control. But local inference has limits. Models fail, responses timeout, or I need capabilities my hardware can’t handle. I didn’t want to choose between “always local” or “always cloud.” I wanted both, with intelligent routing.
The problem: managing multiple model endpoints manually is tedious. Different APIs, different authentication methods, different retry logic. I needed a single interface that could route requests to the right model based on availability, cost, and capability—without changing my application code every time.
That’s why I set up LiteLLM Proxy. It sits between my applications and all my model providers, handling load balancing, fallbacks, and routing automatically.
My Real Setup
I run LiteLLM Proxy as a Docker container on my Proxmox cluster. It fronts:
- Local Ollama instance (llama3.2, mistral, qwen2.5)
- OpenAI API (gpt-4o-mini for when I need reliability)
- Anthropic Claude (claude-3-5-sonnet for complex reasoning)
My applications—n8n workflows, custom Python scripts, and a few internal tools—all point to the LiteLLM endpoint. They don’t know or care which backend actually handles the request.
The Docker Setup
I use Docker Compose to keep the proxy configuration versioned and reproducible:
version: '3.8'
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
container_name: litellm-proxy
ports:
- "4000:4000"
volumes:
- ./config.yaml:/app/config.yaml
environment:
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
command: --config /app/config.yaml --detailed_debug
restart: unless-stopped
The LITELLM_MASTER_KEY is stored in a .env file that’s excluded from version control. This key is what my applications use to authenticate with the proxy.
Configuration File
The real work happens in config.yaml. Here’s my actual configuration with sensitive keys removed:
model_list:
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
rpm: 500
- model_name: claude-3-5-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 50
- model_name: llama3.2
litellm_params:
model: ollama/llama3.2
api_base: http://192.168.1.50:11434
- model_name: qwen2.5
litellm_params:
model: ollama/qwen2.5:14b
api_base: http://192.168.1.50:11434
router_settings:
routing_strategy: usage-based-routing-v2
allowed_fails: 3
cooldown_time: 60
num_retries: 2
timeout: 300
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: "postgresql://litellm:password@postgres:5432/litellm"
A few things worth explaining:
rpmlimits prevent me from accidentally burning through API quotasallowed_failsandcooldown_timetemporarily disable failing backendsusage-based-routing-v2balances load across healthy endpoints- The PostgreSQL database tracks usage, costs, and request logs
How Load Balancing Actually Works
I set up model groups with multiple backends. For general chat tasks, I created a “chat” model that tries local first, then falls back to cloud:
model_list:
- model_name: chat
litellm_params:
model: ollama/llama3.2
api_base: http://192.168.1.50:11434
priority: 1
- model_name: chat
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
priority: 2
Priority determines the order. LiteLLM tries priority 1 (local Ollama) first. If that fails or times out, it automatically routes to priority 2 (OpenAI).
My n8n workflows just call the “chat” model. They have no idea which backend handled the request. This saved me when Ollama crashed during a system update—requests automatically failed over to OpenAI without breaking any workflows.
Testing Fallback Behavior
I deliberately stopped the Ollama container to verify fallback worked:
docker stop ollama
Then sent a test request through LiteLLM:
curl -X POST http://localhost:4000/v1/chat/completions
-H "Authorization: Bearer $LITELLM_MASTER_KEY"
-H "Content-Type: application/json"
-d '{
"model": "chat",
"messages": [{"role": "user", "content": "test fallback"}]
}'
The proxy logged the Ollama failure and routed to OpenAI within 2 seconds. No manual intervention needed.
What Worked
Cost visibility. The PostgreSQL backend tracks every request with model, tokens, and estimated cost. I run a simple query weekly to see where money goes:
SELECT model, COUNT(*), SUM(total_tokens), SUM(response_cost) FROM litellm_spend_logs WHERE created_at > NOW() - INTERVAL '7 days' GROUP BY model;
This showed me I was using Claude for tasks that llama3.2 could handle locally. I adjusted my application logic and cut cloud costs by 40%.
Unified interface. My Python scripts use the OpenAI SDK pointed at the LiteLLM endpoint. Switching backends requires zero code changes—just update the config and restart the proxy.
Automatic retries. Network blips used to break my automation workflows. LiteLLM’s retry logic with exponential backoff handles transient failures without surfacing errors to the application.
Rate limiting. I set conservative RPM limits on cloud providers. When I accidentally wrote a loop that hammered the API, LiteLLM queued requests instead of letting me blow through my quota in minutes.
What Didn’t Work
Initial timeout values were wrong. I set global timeout to 30 seconds. Local Ollama models on my hardware sometimes need 45-60 seconds for long context. Requests failed unnecessarily. I bumped timeout to 300 seconds and added model-specific overrides:
- model_name: llama3.2
litellm_params:
model: ollama/llama3.2
api_base: http://192.168.1.50:11434
timeout: 120
Database filled up faster than expected. Every request gets logged. After a month, the PostgreSQL database hit 2GB. I wasn’t prepared for that growth. I wrote a cleanup script to delete logs older than 30 days and set it in Cronicle.
Streaming responses had quirks. Some of my applications use streaming for real-time output. LiteLLM supports this, but I found that fallback doesn’t work mid-stream. If Ollama starts streaming then crashes, the request fails—it doesn’t switch to OpenAI. This makes sense technically, but I had to adjust my expectations. For critical streaming use cases, I now point directly to cloud providers.
Model name confusion. I initially named my model groups generically (“fast”, “smart”). This became unreadable in logs. I renamed them to reflect actual use cases (“code-review”, “summarization”, “chat”). Much clearer when debugging.
Integration with n8n
Most of my AI automation runs through n8n. I use the HTTP Request node to call LiteLLM instead of the built-in OpenAI node. This gives me flexibility to change backends without touching workflows.
Example n8n HTTP Request configuration:
URL: http://litellm-proxy:4000/v1/chat/completions
Method: POST
Authentication: Generic Credential Type
Header Parameters:
Authorization: Bearer {{ $env.LITELLM_MASTER_KEY }}
Body:
{
"model": "chat",
"messages": [
{"role": "user", "content": "{{ $json.input }}"}
]
}
I store LITELLM_MASTER_KEY as an n8n environment variable. Workflows reference it without hardcoding credentials.
Cost Tracking and Budget Alerts
I added a simple budget alert by querying the spend logs daily. A Cronicle job runs this SQL:
SELECT SUM(response_cost) as daily_cost FROM litellm_spend_logs WHERE created_at > CURRENT_DATE;
If daily cost exceeds $5, it sends me a notification through ntfy. This caught a runaway process that was retrying failed requests in a loop.
Monitoring and Debugging
LiteLLM exposes a /health endpoint that I monitor with Uptime Kuma. It checks every 60 seconds and alerts if the proxy goes down.
For debugging, I tail the Docker logs:
docker logs -f litellm-proxy
The --detailed_debug flag in my Docker command logs every request and response. This is verbose but invaluable when troubleshooting routing issues.
Key Takeaways
Fallback is not magic. It works for request-level failures, not mid-stream. Design your applications accordingly.
Start with conservative timeouts and rate limits. You can always relax them. Starting too aggressive causes mysterious failures that are hard to debug.
Log retention needs a strategy. LiteLLM logs everything by default. Decide early how long you need logs and automate cleanup.
Model naming matters. Use names that make sense in logs and cost reports, not abstract labels.
The proxy is a single point of failure. If LiteLLM goes down, all AI requests fail. I run it on a reliable VM with automatic restarts, but I’ve accepted this risk for the convenience it provides.
Cost visibility changes behavior. Seeing actual spend per model made me rethink which tasks needed cloud APIs. Most don’t.
This setup has been running for three months. I’ve had two unplanned outages—both from me breaking the config file. The automatic fallback has saved me at least a dozen times when Ollama had issues. For my use case, the complexity is worth it.