Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Building a local LLM routing layer with Litellm to automatically fallback between Ollama models based on context length limits

Why I Built This

I run multiple Ollama models locally on my Proxmox server—some small and fast, others larger and more capable. The problem I kept hitting was simple: a prompt would work fine on one model, then fail on another because it exceeded the context window. I’d get a generic error, manually switch models, and try again.

This got old fast. I needed something that could route requests intelligently based on what each model could actually handle, with automatic fallback when a model hit its limits. That’s when I started working with Litellm.

My Setup

I’m running:

  • Ollama on a Proxmox VM with GPU passthrough
  • Three models locally: llama3.2:1b, llama3.1:8b, and mistral:7b
  • Litellm proxy running in a Docker container on the same network
  • n8n workflows that call the proxy for various automation tasks

The goal was to let Litellm sit between my applications and Ollama, automatically picking the right model or falling back when needed—without changing how I call the API.

How I Configured the Router

Litellm uses a YAML config file to define model groups and routing rules. I created a simple config.yaml that looks like this:

model_list:
  - model_name: local-small
    litellm_params:
      model: ollama/llama3.2:1b
      api_base: http://ollama-vm:11434
      max_tokens: 2048

  - model_name: local-medium
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://ollama-vm:11434
      max_tokens: 8192

  - model_name: local-large
    litellm_params:
      model: ollama/mistral:7b
      api_base: http://ollama-vm:11434
      max_tokens: 32768

router_settings:
  routing_strategy: usage-based-routing
  num_retries: 2
  fallbacks:
    - local-small
    - local-medium
    - local-large

This tells Litellm to try the smallest model first, then fall back to medium, then large if something fails. The max_tokens values are what I’ve confirmed each model can actually handle in my setup.

I started the proxy with:

docker run -d 
  --name litellm-proxy 
  -p 4000:4000 
  -v $(pwd)/config.yaml:/app/config.yaml 
  ghcr.io/berriai/litellm:main-latest 
  --config /app/config.yaml

From my applications, I now point to http://litellm-proxy:4000 instead of directly to Ollama. The API is OpenAI-compatible, so nothing else needed to change.

What Worked

The routing logic works exactly as I needed. When I send a request with a short prompt, it hits llama3.2:1b and returns fast. When the prompt is longer or the context grows during a conversation, Litellm automatically routes to the next available model.

I tested this by sending progressively longer prompts through my n8n workflows. The first few went to the 1B model. When I crossed about 1500 tokens, the next request automatically went to the 8B model. No errors, no manual intervention.

The fallback behavior also saved me during model restarts. I was updating Ollama and had to stop one model temporarily. Requests that would have failed just moved to the next model in the chain. My workflows didn’t break.

Another benefit I didn’t expect: cost tracking. Even though these are local models, Litellm logs token usage and which model handled each request. This helped me see that most of my automation tasks could stay on the smallest model, which keeps GPU usage lower.

What Didn’t Work

The initial routing strategy I tried was least-busy, which picks the model with the fewest active requests. In theory, this should balance load. In practice, it sent requests to the large model even when the small one was idle, because Litellm couldn’t tell the difference between “busy” and “unavailable due to context limits.”

Switching to usage-based-routing fixed this. It respects the order I defined and only falls back when there’s an actual failure.

I also ran into an issue with streaming responses. When a model fallback happens mid-stream, the client connection drops. Litellm retries the entire request with the next model, but if you’re watching a response build in real-time, you see it restart. This isn’t a dealbreaker, but it’s noticeable.

One thing I couldn’t get working: automatic routing based on predicted context length before sending the request. Litellm doesn’t analyze the prompt ahead of time to guess which model to use. It only falls back after a failure. This means the first attempt might fail unnecessarily if I know the prompt is large. I’ve started adding a hint in my n8n workflows to specify a model directly when I know the context will be big.

Key Takeaways

Litellm solved the problem I had: automatic fallback between local models when context limits are hit. It’s not magic—it reacts to failures rather than predicting them—but that’s enough for my use case.

The configuration is simple and the proxy is lightweight. I’m running it on the same VM as Ollama with minimal overhead. The OpenAI-compatible API means I didn’t have to rewrite any of my existing automation.

If you’re running multiple local models and want basic routing without building it yourself, this approach works. Just be clear about what it does and doesn’t do. It won’t optimize every request, but it will keep things running when a model can’t handle what you throw at it.

The main limitation is the reactive nature of the fallback. If you need predictive routing or want to avoid any failed attempts, you’ll need to add logic in your application layer. For everything else, Litellm handles it cleanly.

Leave a Comment

Your email address will not be published. Required fields are marked *