Implementing OpenAI-compatible API gateway with LiteLLM to load-balance requests across Ollama, LM Studio, and vLLM backends

Why I Built This Gateway

I run multiple local LLM backends in my homelab—Ollama for quick inference, LM Studio for testing different models, and vLLM when I need production-grade speed. The problem was simple: each one has its own API format, different endpoints, and no built-in way to balance load or fail over if one goes down.

I needed a single entry point that could:

Route requests to any available backend
Use the OpenAI API format everywhere (most tools expect this)
Distribute load automatically
Track which backend handled what

LiteLLM solved this. It’s a proxy that sits between my applications and my backends, presenting one unified OpenAI-compatible API regardless of what’s actually running underneath.

My Setup

I deployed LiteLLM as a Docker container on my Proxmox host. Here’s what I connected it to:

Ollama – Running on a separate VM, handling general-purpose requests
LM Studio – On my desktop machine, used for testing new models
vLLM – Running on a GPU-enabled container for high-throughput tasks

All three expose OpenAI-compatible endpoints, but they’re scattered across different machines and ports. LiteLLM unifies them.

Configuration File

LiteLLM uses a YAML config to define backends. Here’s what mine looks like:

model_list:
  - model_name: general
    litellm_params:
      model: openai/llama3.2
      api_base: http://192.168.1.50:11434/v1
      api_key: "not-needed"

  - model_name: general
    litellm_params:
      model: openai/mistral
      api_base: http://192.168.1.60:1234/v1
      api_key: "not-needed"

  - model_name: fast
    litellm_params:
      model: openai/llama3.2
      api_base: http://192.168.1.70:8000/v1
      api_key: "not-needed"

Key points from this config:

The openai/ prefix tells LiteLLM to treat these as OpenAI-compatible endpoints
Multiple backends can share the same model_name—this enables load balancing
The api_key field is required by the underlying library, even if your backends don’t need one
The /v1 suffix on api_base is critical—without it, requests fail

Docker Deployment

I run LiteLLM in a Docker container with this command:

docker run -d 
  --name litellm 
  -p 4000:4000 
  -v $(pwd)/config.yaml:/app/config.yaml 
  ghcr.io/berriai/litellm:main-latest 
  --config /app/config.yaml

The proxy listens on port 4000. Any application that expects an OpenAI API can now point to http://192.168.1.100:4000 instead of directly hitting my backends.

What Worked

Automatic Load Balancing

When I send a request to the general model, LiteLLM picks between Ollama and LM Studio automatically. I didn’t have to write any routing logic—it just works.

If one backend is slow or down, the next request goes elsewhere. This saved me multiple times when I was upgrading Ollama or restarting vLLM.

Unified Logging

LiteLLM logs every request with timing, backend used, and token counts. This made debugging much easier than tailing logs from three different services.

I can see exactly which backend handled a request and how long it took. When vLLM started returning slow responses, I caught it immediately in the logs.

OpenAI Client Compatibility

I use the official OpenAI Python client in my scripts. Pointing it at LiteLLM required only changing the base_url:

import openai

client = openai.OpenAI(
    api_key="not-used",
    base_url="http://192.168.1.100:4000"
)

response = client.chat.completions.create(
    model="general",
    messages=[{"role": "user", "content": "test"}]
)

No other code changes. This worked with n8n, my custom scripts, and even tools like Continue.dev in VSCode.

What Didn’t Work

Missing /v1 Suffix

The first time I configured LiteLLM, I set api_base to http://192.168.1.50:11434 without the /v1 suffix. Every request returned a 404.

LiteLLM uses the OpenAI client library internally, which automatically appends /chat/completions to the base URL. Without /v1, the final URL was malformed. Adding it fixed the issue immediately.

System Message Handling

Some models I tried (like Google’s Gemma) don’t support system messages. LiteLLM would pass them through anyway, causing errors.

The fix was adding supports_system_message: False to the model config. This tells LiteLLM to convert system messages to user messages before sending them to the backend.

API Key Confusion

LiteLLM requires an api_key field in every model config, even when your backend doesn’t need one. I initially left it blank, which caused cryptic errors.

Setting it to any non-empty string (like "not-needed") solved it. The key isn’t validated—it’s just passed through to satisfy the OpenAI client library.

Load Balancing Isn’t Smart

LiteLLM balances load by round-robin, not by backend performance. If vLLM is 10x faster than Ollama, they still get equal traffic.

This isn’t a problem for my use case, but it means you can’t prioritize faster backends without custom routing logic.

Key Takeaways

LiteLLM works as advertised—it unifies multiple backends behind one OpenAI-compatible API
Always include /v1 in your api_base URLs
The api_key field is mandatory even when backends don’t need it
Load balancing is simple round-robin, not performance-aware
Logging is excellent and makes debugging much easier
Using model_name groups lets you abstract backend details from your applications

This setup has been running for three months without issues. I can swap backends, upgrade models, or add new endpoints without changing any application code. That’s exactly what I needed.

Tech Expert & Vibe Coder

Why I Built This Gateway

My Setup

Configuration File

Docker Deployment

What Worked

Automatic Load Balancing

Unified Logging

OpenAI Client Compatibility

What Didn’t Work

Missing /v1 Suffix

System Message Handling

API Key Confusion

Load Balancing Isn’t Smart

Key Takeaways

Category:

Implementing automatic model...

Setting up hybrid inference...

Leave a Comment Cancel reply

Categories

Related Posts

Implementing automatic model selection based on...

Setting up hybrid inference pipelines: routing...

Debugging token generation slowdowns in LM Studio...

About Me

Vipin PG

Tech Expert & Vibe Coder

Implementing OpenAI-compatible API gateway with LiteLLM to load-balance requests across Ollama, LM Studio, and vLLM backends

Why I Built This Gateway

My Setup

Configuration File

Docker Deployment

What Worked

Automatic Load Balancing

Unified Logging

OpenAI Client Compatibility

What Didn’t Work

Missing /v1 Suffix

System Message Handling

API Key Confusion

Load Balancing Isn’t Smart

Key Takeaways

Category:

Implementing automatic model...

Setting up hybrid inference...

Leave a Comment Cancel reply

Subscribe to Newsletter

Categories

Related Posts

Implementing automatic model selection based on...

Setting up hybrid inference pipelines: routing...

Debugging token generation slowdowns in LM Studio...

About Me

Vipin PG