Building a local LLM routing layer with Litellm to automatically fallback between Ollama models based on context length limits

Why I Built This

I run multiple Ollama models locally on my Proxmox server—some small and fast, others larger and more capable. The problem I kept hitting was simple: a prompt would work fine on one model, then fail on another because it exceeded the context window. I’d get a generic error, manually switch models, and try again.

This got old fast. I needed something that could route requests intelligently based on what each model could actually handle, with automatic fallback when a model hit its limits. That’s when I started working with Litellm.

Building a local LLM routing layer with Litellm to automatically fallback between Ollama models based on context length limits

My Setup

I’m running:

Ollama on a Proxmox VM with GPU passthrough
Three models locally: llama3.2:1b, llama3.1:8b, and mistral:7b
Litellm proxy running in a Docker container on the same network
n8n workflows that call the proxy for various automation tasks

The goal was to let Litellm sit between my applications and Ollama, automatically picking the right model or falling back when needed—without changing how I call the API.

How I Configured the Router

Litellm uses a YAML config file to define model groups and routing rules. I created a simple config.yaml that looks like this:

model_list:
  - model_name: local-small
    litellm_params:
      model: ollama/llama3.2:1b
      api_base: http://ollama-vm:11434
      max_tokens: 2048

  - model_name: local-medium
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://ollama-vm:11434
      max_tokens: 8192

  - model_name: local-large
    litellm_params:
      model: ollama/mistral:7b
      api_base: http://ollama-vm:11434
      max_tokens: 32768

router_settings:
  routing_strategy: usage-based-routing
  num_retries: 2
  fallbacks:
    - local-small
    - local-medium
    - local-large

This tells Litellm to try the smallest model first, then fall back to medium, then large if something fails. The max_tokens values are what I’ve confirmed each model can actually handle in my setup.

I started the proxy with:

docker run -d 
  --name litellm-proxy 
  -p 4000:4000 
  -v $(pwd)/config.yaml:/app/config.yaml 
  ghcr.io/berriai/litellm:main-latest 
  --config /app/config.yaml

From my applications, I now point to http://litellm-proxy:4000 instead of directly to Ollama. The API is OpenAI-compatible, so nothing else needed to change.

What Worked

The routing logic works exactly as I needed. When I send a request with a short prompt, it hits llama3.2:1b and returns fast. When the prompt is longer or the context grows during a conversation, Litellm automatically routes to the next available model.

I tested this by sending progressively longer prompts through my n8n workflows. The first few went to the 1B model. When I crossed about 1500 tokens, the next request automatically went to the 8B model. No errors, no manual intervention.

The fallback behavior also saved me during model restarts. I was updating Ollama and had to stop one model temporarily. Requests that would have failed just moved to the next model in the chain. My workflows didn’t break.

Another benefit I didn’t expect: cost tracking. Even though these are local models, Litellm logs token usage and which model handled each request. This helped me see that most of my automation tasks could stay on the smallest model, which keeps GPU usage lower.

What Didn’t Work

The initial routing strategy I tried was least-busy, which picks the model with the fewest active requests. In theory, this should balance load. In practice, it sent requests to the large model even when the small one was idle, because Litellm couldn’t tell the difference between “busy” and “unavailable due to context limits.”

Switching to usage-based-routing fixed this. It respects the order I defined and only falls back when there’s an actual failure.

I also ran into an issue with streaming responses. When a model fallback happens mid-stream, the client connection drops. Litellm retries the entire request with the next model, but if you’re watching a response build in real-time, you see it restart. This isn’t a dealbreaker, but it’s noticeable.

One thing I couldn’t get working: automatic routing based on predicted context length before sending the request. Litellm doesn’t analyze the prompt ahead of time to guess which model to use. It only falls back after a failure. This means the first attempt might fail unnecessarily if I know the prompt is large. I’ve started adding a hint in my n8n workflows to specify a model directly when I know the context will be big.

Key Takeaways

Litellm solved the problem I had: automatic fallback between local models when context limits are hit. It’s not magic—it reacts to failures rather than predicting them—but that’s enough for my use case.

The configuration is simple and the proxy is lightweight. I’m running it on the same VM as Ollama with minimal overhead. The OpenAI-compatible API means I didn’t have to rewrite any of my existing automation.

If you’re running multiple local models and want basic routing without building it yourself, this approach works. Just be clear about what it does and doesn’t do. It won’t optimize every request, but it will keep things running when a model can’t handle what you throw at it.

The main limitation is the reactive nature of the fallback. If you need predictive routing or want to avoid any failed attempts, you’ll need to add logic in your application layer. For everything else, Litellm handles it cleanly.

Tech Expert & Vibe Coder

Building a local LLM routing layer with Litellm to automatically fallback between Ollama models based on context length limits

Why I Built This

My Setup

How I Configured the Router

What Worked

What Didn’t Work

Key Takeaways

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Built This

My Setup

How I Configured the Router

What Worked

What Didn’t Work

Key Takeaways

Benchmarking Ollama vs LM Studio inference speeds across different quantization formats on consumer GPUs in 2026

Implementing token-based cost tracking for self-hosted LLM APIs using Prometheus and Grafana to monitor usage patterns

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Implementing automatic model selection based on query complexity: using...

Setting up hybrid inference pipelines: routing complex reasoning tasks to...

Debugging token generation slowdowns in LM Studio after extended uptime:...

Get new posts and practical tech notes in your inbox.