building multi-model ai pipelines with litellm proxy: load balancing requests between local ollama and cloud apis with automatic fallback

Why I Built This

I run AI workloads on my home lab—mostly local models through Ollama for privacy-sensitive tasks and cost control. But local inference has limits. Models fail, responses timeout, or I need capabilities my hardware can’t handle. I didn’t want to choose between “always local” or “always cloud.” I wanted both, with intelligent routing.

The problem: managing multiple model endpoints manually is tedious. Different APIs, different authentication methods, different retry logic. I needed a single interface that could route requests to the right model based on availability, cost, and capability—without changing my application code every time.

building multi-model ai pipelines with litellm proxy: load balancing requests between local ollama and cloud apis with automatic fallback

That’s why I set up LiteLLM Proxy. It sits between my applications and all my model providers, handling load balancing, fallbacks, and routing automatically.

My Real Setup

I run LiteLLM Proxy as a Docker container on my Proxmox cluster. It fronts:

Local Ollama instance (llama3.2, mistral, qwen2.5)
OpenAI API (gpt-4o-mini for when I need reliability)
Anthropic Claude (claude-3-5-sonnet for complex reasoning)

My applications—n8n workflows, custom Python scripts, and a few internal tools—all point to the LiteLLM endpoint. They don’t know or care which backend actually handles the request.

The Docker Setup

I use Docker Compose to keep the proxy configuration versioned and reproducible:

version: '3.8'

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    ports:
      - "4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml
    environment:
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    command: --config /app/config.yaml --detailed_debug
    restart: unless-stopped

The LITELLM_MASTER_KEY is stored in a .env file that’s excluded from version control. This key is what my applications use to authenticate with the proxy.

Configuration File

The real work happens in config.yaml. Here’s my actual configuration with sensitive keys removed:

model_list:
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500
      
  - model_name: claude-3-5-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 50
      
  - model_name: llama3.2
    litellm_params:
      model: ollama/llama3.2
      api_base: http://192.168.1.50:11434
      
  - model_name: qwen2.5
    litellm_params:
      model: ollama/qwen2.5:14b
      api_base: http://192.168.1.50:11434

router_settings:
  routing_strategy: usage-based-routing-v2
  allowed_fails: 3
  cooldown_time: 60
  num_retries: 2
  timeout: 300
  
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: "postgresql://litellm:password@postgres:5432/litellm"

A few things worth explaining:

rpm limits prevent me from accidentally burning through API quotas
allowed_fails and cooldown_time temporarily disable failing backends
usage-based-routing-v2 balances load across healthy endpoints
The PostgreSQL database tracks usage, costs, and request logs

How Load Balancing Actually Works

I set up model groups with multiple backends. For general chat tasks, I created a “chat” model that tries local first, then falls back to cloud:

model_list:
  - model_name: chat
    litellm_params:
      model: ollama/llama3.2
      api_base: http://192.168.1.50:11434
      priority: 1
      
  - model_name: chat
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
      priority: 2

Priority determines the order. LiteLLM tries priority 1 (local Ollama) first. If that fails or times out, it automatically routes to priority 2 (OpenAI).

My n8n workflows just call the “chat” model. They have no idea which backend handled the request. This saved me when Ollama crashed during a system update—requests automatically failed over to OpenAI without breaking any workflows.

Testing Fallback Behavior

I deliberately stopped the Ollama container to verify fallback worked:

docker stop ollama

Then sent a test request through LiteLLM:

curl -X POST http://localhost:4000/v1/chat/completions 
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" 
  -H "Content-Type: application/json" 
  -d '{
    "model": "chat",
    "messages": [{"role": "user", "content": "test fallback"}]
  }'

The proxy logged the Ollama failure and routed to OpenAI within 2 seconds. No manual intervention needed.

What Worked

Cost visibility. The PostgreSQL backend tracks every request with model, tokens, and estimated cost. I run a simple query weekly to see where money goes:

SELECT model, COUNT(*), SUM(total_tokens), SUM(response_cost) 
FROM litellm_spend_logs 
WHERE created_at > NOW() - INTERVAL '7 days' 
GROUP BY model;

This showed me I was using Claude for tasks that llama3.2 could handle locally. I adjusted my application logic and cut cloud costs by 40%.

Unified interface. My Python scripts use the OpenAI SDK pointed at the LiteLLM endpoint. Switching backends requires zero code changes—just update the config and restart the proxy.

Automatic retries. Network blips used to break my automation workflows. LiteLLM’s retry logic with exponential backoff handles transient failures without surfacing errors to the application.

Rate limiting. I set conservative RPM limits on cloud providers. When I accidentally wrote a loop that hammered the API, LiteLLM queued requests instead of letting me blow through my quota in minutes.

What Didn’t Work

Initial timeout values were wrong. I set global timeout to 30 seconds. Local Ollama models on my hardware sometimes need 45-60 seconds for long context. Requests failed unnecessarily. I bumped timeout to 300 seconds and added model-specific overrides:

- model_name: llama3.2
  litellm_params:
    model: ollama/llama3.2
    api_base: http://192.168.1.50:11434
    timeout: 120

Database filled up faster than expected. Every request gets logged. After a month, the PostgreSQL database hit 2GB. I wasn’t prepared for that growth. I wrote a cleanup script to delete logs older than 30 days and set it in Cronicle.

Streaming responses had quirks. Some of my applications use streaming for real-time output. LiteLLM supports this, but I found that fallback doesn’t work mid-stream. If Ollama starts streaming then crashes, the request fails—it doesn’t switch to OpenAI. This makes sense technically, but I had to adjust my expectations. For critical streaming use cases, I now point directly to cloud providers.

Model name confusion. I initially named my model groups generically (“fast”, “smart”). This became unreadable in logs. I renamed them to reflect actual use cases (“code-review”, “summarization”, “chat”). Much clearer when debugging.

Integration with n8n

Most of my AI automation runs through n8n. I use the HTTP Request node to call LiteLLM instead of the built-in OpenAI node. This gives me flexibility to change backends without touching workflows.

Example n8n HTTP Request configuration:

URL: http://litellm-proxy:4000/v1/chat/completions
Method: POST
Authentication: Generic Credential Type
Header Parameters:
  Authorization: Bearer {{ $env.LITELLM_MASTER_KEY }}
Body:
{
  "model": "chat",
  "messages": [
    {"role": "user", "content": "{{ $json.input }}"}
  ]
}

I store LITELLM_MASTER_KEY as an n8n environment variable. Workflows reference it without hardcoding credentials.

Cost Tracking and Budget Alerts

I added a simple budget alert by querying the spend logs daily. A Cronicle job runs this SQL:

SELECT SUM(response_cost) as daily_cost 
FROM litellm_spend_logs 
WHERE created_at > CURRENT_DATE;

If daily cost exceeds $5, it sends me a notification through ntfy. This caught a runaway process that was retrying failed requests in a loop.

Monitoring and Debugging

LiteLLM exposes a /health endpoint that I monitor with Uptime Kuma. It checks every 60 seconds and alerts if the proxy goes down.

For debugging, I tail the Docker logs:

docker logs -f litellm-proxy

The --detailed_debug flag in my Docker command logs every request and response. This is verbose but invaluable when troubleshooting routing issues.

Key Takeaways

Fallback is not magic. It works for request-level failures, not mid-stream. Design your applications accordingly.

Start with conservative timeouts and rate limits. You can always relax them. Starting too aggressive causes mysterious failures that are hard to debug.

Log retention needs a strategy. LiteLLM logs everything by default. Decide early how long you need logs and automate cleanup.

Model naming matters. Use names that make sense in logs and cost reports, not abstract labels.

The proxy is a single point of failure. If LiteLLM goes down, all AI requests fail. I run it on a reliable VM with automatic restarts, but I’ve accepted this risk for the convenience it provides.

Cost visibility changes behavior. Seeing actual spend per model made me rethink which tasks needed cloud APIs. Most don’t.

This setup has been running for three months. I’ve had two unplanned outages—both from me breaking the config file. The automatic fallback has saved me at least a dozen times when Ollama had issues. For my use case, the complexity is worth it.

Tech Expert & Vibe Coder

building multi-model ai pipelines with litellm proxy: load balancing requests between local ollama and cloud apis with automatic fallback

Why I Built This

My Real Setup

The Docker Setup

Configuration File

How Load Balancing Actually Works

Testing Fallback Behavior

What Worked

What Didn’t Work

Integration with n8n

Cost Tracking and Budget Alerts

Monitoring and Debugging

Key Takeaways

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Built This

My Real Setup

The Docker Setup

Configuration File

How Load Balancing Actually Works

Testing Fallback Behavior

What Worked

What Didn’t Work

Integration with n8n

Cost Tracking and Budget Alerts

Monitoring and Debugging

Key Takeaways

debugging raspberry pi 5 performance bottlenecks when hosting llama 3.3 70b: thermal throttling vs memory bandwidth limits

setting up speculative decoding for faster llm inference: configuring draft models in lm studio to double token generation speed

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Implementing automatic model selection based on query complexity: using...

Setting up hybrid inference pipelines: routing complex reasoning tasks to...

Debugging token generation slowdowns in LM Studio after extended uptime:...

Get new posts and practical tech notes in your inbox.