Setting up hybrid inference pipelines: routing complex reasoning tasks to DeepSeek-V3 while keeping simple queries on local Llama models

Why I Built This Hybrid Setup

I run a local Llama 3.1 8B model on my Proxmox homelab for quick tasks—summarizing notes, extracting data from logs, answering simple questions about my documentation. It’s fast, private, and costs nothing per query. But when I needed deeper reasoning—multi-step analysis of network logs, debugging complex Docker Compose issues, or planning infrastructure changes—the 8B model would hallucinate or give surface-level answers.

I didn’t want to route everything to a cloud API. That would kill the speed advantage and rack up costs for trivial queries. But I also couldn’t ignore that DeepSeek-V3, which I access through their API, consistently gave me better reasoning on hard problems. So I built a routing layer that decides which model handles each request based on what the query actually needs.

Setting up hybrid inference pipelines: routing complex reasoning tasks to DeepSeek-V3 while keeping simple queries on local Llama models

What I’m Actually Running

My setup has three pieces:

Local Llama 3.1 8B running in Ollama on a Proxmox VM with GPU passthrough (RTX 3060). This handles 70-80% of my queries.
DeepSeek-V3 API for complex reasoning tasks. I use their official API endpoint, not a third-party wrapper.
Routing logic written in Python, running as a small Flask service in a Docker container on the same Proxmox host.

The router accepts requests from my n8n workflows, CLI tools, and a simple web interface I built. It classifies the query, picks the model, sends the request, and returns the response. No manual switching.

How the Router Decides

I tried a few approaches before settling on this one. My first attempt used keyword matching—if the query contained words like “analyze,” “debug,” or “plan,” it went to DeepSeek. This failed immediately because plenty of simple questions use those words.

What worked was a two-stage check:

Complexity heuristics: Query length, number of clauses, presence of code blocks, and whether it references previous context. If the query is under 100 characters and asks a single question, it goes local.
Lightweight classification: For borderline cases, I send the query to the local model with a system prompt asking it to classify itself as “simple” or “complex.” This sounds circular, but the 8B model is decent at recognizing when it’s out of its depth. If it says “complex,” I re-route to DeepSeek.

I also hardcoded a few patterns. Anything asking for code generation longer than 50 lines, multi-step debugging, or “explain why this failed” with attached logs goes straight to DeepSeek. Queries like “what’s my IP,” “summarize this paragraph,” or “list Docker containers” stay local.

The Routing Service

Here’s the core logic in Python. I’m not including the full Flask boilerplate, just the decision function:

def route_query(query, context=None):
    # Quick filters
    if len(query) < 100 and "?" in query and query.count(".")  200 or (context and len(context) > 500):
            return "deepseek"
    
    # Self-classification for borderline cases
    classification_prompt = f"Is this query simple or complex? Query: {query}"
    local_response = call_ollama(classification_prompt)
    
    if "complex" in local_response.lower():
        return "deepseek"
    
    return "local"

The call_ollama function hits my local Ollama instance at http://192.168.1.50:11434. If the route is “deepseek,” I call their API with my key stored in an environment variable. I don’t retry on failure—if DeepSeek is down, I log the error and fall back to local with a warning.

What Worked

This setup cut my API costs by about 75% while keeping response quality high. Most of my queries—”what’s this error code,” “reformat this JSON,” “summarize these bullet points”—don’t need DeepSeek’s reasoning. The local model handles them in under a second.

For the queries that do get routed to DeepSeek, the extra latency (2-4 seconds vs. 0.5 seconds local) is worth it. I’ve had it correctly debug a Traefik routing issue by tracing through my config files, explain why a Docker volume mount wasn’t persisting data, and suggest a better n8n workflow structure. The 8B model would have guessed or given generic advice.

The self-classification step works better than I expected. In about 90% of borderline cases, the local model correctly identifies when it’s being asked something beyond its scope. The 10% where it’s wrong usually involves queries that look simple but require deep context—like “is this normal?” with a log snippet. I’m still tuning this.

What Didn’t Work

My first version tried to use embeddings to classify query complexity. I generated embeddings for a set of “simple” and “complex” example queries, then compared new queries using cosine similarity. This was slow, fragile, and didn’t generalize well. A query like “optimize this SQL” would match “simple” examples because of the short length, even though it clearly needed deeper reasoning.

I also tried caching DeepSeek responses to avoid repeat API calls. This backfired because my queries are rarely identical—they’re variations on themes. The cache hit rate was under 5%, and managing invalidation wasn’t worth the complexity.

Another mistake: I initially routed based on the response from the local model. If it gave a confident answer, I’d return it. If it hedged or said “I’m not sure,” I’d re-query DeepSeek. This doubled latency on complex queries and still missed cases where the local model confidently hallucinated.

Current Limitations

The router doesn’t handle multi-turn conversations well. If I ask a follow-up question, it treats it as a new query and might route it differently than the original. I’m working on passing conversation history to the routing logic, but that adds complexity.

It also can’t predict when I’ll want the slower, better answer even for a simple query. Sometimes I ask “what’s this error” knowing the local model will give a generic answer, but I don’t want to manually override the router every time.

And the self-classification step adds 0.5 seconds to borderline queries. For most of my use cases, that’s fine. But if I were building this for real-time chat, it would be too slow.

Key Takeaways

Routing by query complexity works, but you need both heuristics and a fallback mechanism. Pure keyword matching fails.
Having the local model classify its own capability is surprisingly effective, even if it feels hacky.
Most queries don’t need the best model. Optimizing for cost and speed on the common case matters more than perfect accuracy on edge cases.
Caching is overrated unless your queries are highly repetitive. Mine aren’t.
Latency matters. Even a 2-second delay feels slow when you’re used to local responses.

What I’d Change

If I were starting over, I’d build the router into my n8n workflows instead of as a separate service. That would let me route based on workflow context, not just the query itself. For example, a “summarize” node could always go local, while a “debug” node always goes to DeepSeek.

I’d also track which queries get routed where and manually review misclassifications once a week. Right now, I only notice routing mistakes when I get a bad answer, which means I’m missing cases where the local model got lucky or DeepSeek was overkill.

And I’d experiment with smaller DeepSeek models (if they release them) or other mid-tier APIs. V3 is powerful, but I don’t always need that much reasoning. A hypothetical “DeepSeek-V3-Lite” might fill the gap between my 8B local model and the full API.

Running Costs

My local setup costs me nothing per query—just the electricity for the GPU, which is negligible since the server runs 24/7 anyway. DeepSeek charges per token, and I’m averaging about $8-12/month with this hybrid approach. Before routing, I was paying $30-40/month by sending everything to GPT-4 or Claude.

The router itself uses maybe 100MB of RAM and barely touches the CPU. It’s not a resource concern.

Would I Recommend This?

If you’re already running a local model and occasionally need better reasoning, yes. The routing layer is simple enough to build in an afternoon, and the cost savings are real.

If you’re not already self-hosting, this probably isn’t worth it. Just use a cloud API for everything. The complexity of running Ollama, managing a router, and debugging when things break only makes sense if you’re committed to local-first infrastructure.

And if your queries are all complex or all simple, skip the router. It’s only useful when you have a mix.

Tech Expert & Vibe Coder

Setting up hybrid inference pipelines: routing complex reasoning tasks to DeepSeek-V3 while keeping simple queries on local Llama models

Why I Built This Hybrid Setup

What I’m Actually Running

How the Router Decides

The Routing Service

What Worked

What Didn’t Work

Current Limitations

Key Takeaways

What I’d Change

Running Costs

Would I Recommend This?

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Built This Hybrid Setup

What I’m Actually Running

How the Router Decides

The Routing Service

What Worked

What Didn’t Work

Current Limitations

Key Takeaways

What I’d Change

Running Costs

Would I Recommend This?

Debugging token generation slowdowns in LM Studio after extended uptime: identifying model cache corruption and implementing automatic recovery

Implementing automatic model selection based on query complexity: using lightweight classifiers to route requests between quantized and full-precision models in Ollama

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Implementing automatic model selection based on query complexity: using...

Debugging token generation slowdowns in LM Studio after extended uptime:...

Building AI-powered mathematical proof assistants with local LLMs: implementing...

Get new posts and practical tech notes in your inbox.