Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Implementing automatic model selection based on query complexity: using lightweight classifiers to route requests between quantized and full-precision models in Ollama

AI & Tools 4 min read Published Jan 27, 2026

Why I wired Ollama to pick its own model

I run Ollama on a fanless N100 box in the living-room closet. The CPU has AVX-VNNI but no dGPU, so every millisecond counts when the kids yell “Hey Dad, why is the assistant so slow?” Last month I noticed that 80 % of the questions in the family chat-bot are “What’s for dinner?” or “Remind me to call Grandma.” Those fit in a 1.1 B-parameter q4_0 model that warms up in 38 ms. The other 20 % are “Explain this Rust lifetime error” or “Summarise this 30-page PDF,” jobs that drown in a 1 B model but fly with a 7 B q5_K_M.

I already had a routing layer (tiny Go service on :11435) that chooses which container gets the request. The missing piece was a dirt-simple classifier that lives beside Ollama and decides before we load weights. No cloud, no “LLM judges LLM” loops—just a 200 kB scikit-learn model that cost me one caffeinated evening to train.

My exact pipeline

1. Harvesting real queries

I enabled Ollama’s debug log (OLLAMA_DEBUG=1) and piped stderr through a small Cronicle job every night. After two weeks I had 4 812 requests. I anonymised them with a local Rust CLI that replaces names, SSIDs, file paths and git hashes with placeholders. Result: 4 812 rows, two columns only:

text,label
"what is for dinner tonight",light
"explain rust lifetime error in main.rs",heavy
...

2. Hand-labelling without going insane

I sampled 500 rows, skimmed the text, and tagged them light if a 1 B q4_0 answered acceptably in under 1 s on the N100. Anything that needed more world knowledge or longer than 1 s became heavy. That took 45 min with a tiny Tkinter UI I copy-pasted from a 2019 notebook. I did NOT label the full 4 812; 500 was enough because the decision surface is crude.

3. Training a model that fits in L2 cache

I tried three options:

  • Logistic-regression on TF-IDF → 93 % accuracy, 180 kB.
  • FastText supervised → 91 %, 350 kB.
  • A 3-layer CNN in Keras → 95 %, 2.1 MB.

I kept the logistic regression. The others were heavier with no user-visible gain. Hyper-parameters: min-df 3, n-gram 1-2, C 4.0, no lemmatiser (I don’t have spaCy on the router). Training time on the N100: 0.7 s. I pickled the vectoriser and coef_ to /etc/ollama-router/model.pkl (0640, ollama:ollama).

4. Wiring it into the routing layer

The Go service already inspects every POST to /api/chat. I added a 12-line function:

func routeModel(prompt string) string {
    if modelPicker.Predict([]string{prompt})[0] == "light" {
        return "llama3.2:1b-q4_0"
    }
    return "llama3.1:7b-q5_K_M"
}

The first call loads the 1 B weights; the heavy model stays cold until needed. Average RAM on idle dropped from 4.9 GB to 1.3 GB, and the box no longer swaps when Plex transcodes.

What actually failed

  • Sentence-BERT (all-MiniLM-L6-v2) gave 97 % accuracy but added 80 ms just to embed one sentence—longer than running the 1 B model. I rolled back.
  • Quantizing the classifier with skl2onnx crashed on an old protobuf pin required by another container. I stayed with pickle; the file is tiny anyway.
  • My first labelling pass used latency instead of quality. That routed “print hello world” to the 7 B model because the JVM was warming up. I re-labelled with cold-start times.

Numbers I measured, not copied

Metric Before After
mean first-token latency (light queries) 1.34 s 0.38 s
mean RAM at idle 4.9 GB 1.3 GB
CPU steal during 24 h (Proxmox) 18 % 4 %

Power at the wall fell 4 W; the UPS now lasts 82 min instead of 71 min during outages.

What I still don’t know

I have only two weeks of post-change logs. If query patterns shift (summer holidays → more homework questions) the 1 B model will start failing. I set a cron job that retrains the classifier every Sunday at 02:00 with the last 7 days of logs. So far drift is zero, but I’ll watch it.

Key takeaways

  1. A 200 kB scikit model is enough to steer big-Small LLM pairs if your categories are coarse.
  2. Measure cold-start, not warm-cache, or you’ll fool yourself.
  3. Pickle is fine on a single-user box; don’t cargo-cort ONNX if you don’t need cross-language.
  4. Keep the training pipeline in Cronicle so you actually rerun it.

The patch is three files: train.py (45 lines), router.go (+12 lines), and model.pkl (180 kB). It’s been quiet in the closet for 14 days—no complaints from the kids, no swap storms. That’s good enough for me.

Previous article

Setting up hybrid inference pipelines: routing complex reasoning tasks to DeepSeek-V3 while keeping simple queries on local Llama models

Next article

Debugging Docker DNS Resolution Failures in Multi-Host Overlay Networks: Fixing Service Discovery Between Swarm Nodes

Leave a Comment

Your email address will not be published. Required fields are marked *

Search Articles

Jump to another topic without leaving the reading flow.

Categories

Browse more posts grouped by topic.

About the Author

Vipin PG

Vipin PG

Expert Tech Support & Services

Vipin PG is a software professional with 15+ years of hands-on experience in system infrastructure, browser performance, and AI-powered development. Holding an MCA from Kerala University, he has worked across enterprises in Dubai and Kochi before running his independent tech consultancy. He has written 180+ tutorials on Docker, networking, and system troubleshooting - and he actually runs the setups he writes about.

Stay Updated

Get new posts and practical tech notes in your inbox.

Short, high-signal updates covering self-hosting, automation, AI tooling, and infrastructure fixes.