Why I Built This
I run multiple Ollama models locally on my Proxmox server—some small and fast, others larger and more capable. The problem I kept hitting was simple: a prompt would work fine on one model, then fail on another because it exceeded the context window. I’d get a generic error, manually switch models, and try again.
This got old fast. I needed something that could route requests intelligently based on what each model could actually handle, with automatic fallback when a model hit its limits. That’s when I started working with Litellm.
My Setup
I’m running:
- Ollama on a Proxmox VM with GPU passthrough
- Three models locally:
llama3.2:1b,llama3.1:8b, andmistral:7b - Litellm proxy running in a Docker container on the same network
- n8n workflows that call the proxy for various automation tasks
The goal was to let Litellm sit between my applications and Ollama, automatically picking the right model or falling back when needed—without changing how I call the API.
How I Configured the Router
Litellm uses a YAML config file to define model groups and routing rules. I created a simple config.yaml that looks like this:
model_list:
- model_name: local-small
litellm_params:
model: ollama/llama3.2:1b
api_base: http://ollama-vm:11434
max_tokens: 2048
- model_name: local-medium
litellm_params:
model: ollama/llama3.1:8b
api_base: http://ollama-vm:11434
max_tokens: 8192
- model_name: local-large
litellm_params:
model: ollama/mistral:7b
api_base: http://ollama-vm:11434
max_tokens: 32768
router_settings:
routing_strategy: usage-based-routing
num_retries: 2
fallbacks:
- local-small
- local-medium
- local-large
This tells Litellm to try the smallest model first, then fall back to medium, then large if something fails. The max_tokens values are what I’ve confirmed each model can actually handle in my setup.
I started the proxy with:
docker run -d
--name litellm-proxy
-p 4000:4000
-v $(pwd)/config.yaml:/app/config.yaml
ghcr.io/berriai/litellm:main-latest
--config /app/config.yaml
From my applications, I now point to http://litellm-proxy:4000 instead of directly to Ollama. The API is OpenAI-compatible, so nothing else needed to change.
What Worked
The routing logic works exactly as I needed. When I send a request with a short prompt, it hits llama3.2:1b and returns fast. When the prompt is longer or the context grows during a conversation, Litellm automatically routes to the next available model.
I tested this by sending progressively longer prompts through my n8n workflows. The first few went to the 1B model. When I crossed about 1500 tokens, the next request automatically went to the 8B model. No errors, no manual intervention.
The fallback behavior also saved me during model restarts. I was updating Ollama and had to stop one model temporarily. Requests that would have failed just moved to the next model in the chain. My workflows didn’t break.
Another benefit I didn’t expect: cost tracking. Even though these are local models, Litellm logs token usage and which model handled each request. This helped me see that most of my automation tasks could stay on the smallest model, which keeps GPU usage lower.
What Didn’t Work
The initial routing strategy I tried was least-busy, which picks the model with the fewest active requests. In theory, this should balance load. In practice, it sent requests to the large model even when the small one was idle, because Litellm couldn’t tell the difference between “busy” and “unavailable due to context limits.”
Switching to usage-based-routing fixed this. It respects the order I defined and only falls back when there’s an actual failure.
I also ran into an issue with streaming responses. When a model fallback happens mid-stream, the client connection drops. Litellm retries the entire request with the next model, but if you’re watching a response build in real-time, you see it restart. This isn’t a dealbreaker, but it’s noticeable.
One thing I couldn’t get working: automatic routing based on predicted context length before sending the request. Litellm doesn’t analyze the prompt ahead of time to guess which model to use. It only falls back after a failure. This means the first attempt might fail unnecessarily if I know the prompt is large. I’ve started adding a hint in my n8n workflows to specify a model directly when I know the context will be big.
Key Takeaways
Litellm solved the problem I had: automatic fallback between local models when context limits are hit. It’s not magic—it reacts to failures rather than predicting them—but that’s enough for my use case.
The configuration is simple and the proxy is lightweight. I’m running it on the same VM as Ollama with minimal overhead. The OpenAI-compatible API means I didn’t have to rewrite any of my existing automation.
If you’re running multiple local models and want basic routing without building it yourself, this approach works. Just be clear about what it does and doesn’t do. It won’t optimize every request, but it will keep things running when a model can’t handle what you throw at it.
The main limitation is the reactive nature of the fallback. If you need predictive routing or want to avoid any failed attempts, you’ll need to add logic in your application layer. For everything else, Litellm handles it cleanly.