Implementing automatic model selection based on query complexity: using lightweight classifiers to route requests between quantized and full-precision models in Ollama

Why I wired Ollama to pick its own model

I run Ollama on a fanless N100 box in the living-room closet. The CPU has AVX-VNNI but no dGPU, so every millisecond counts when the kids yell “Hey Dad, why is the assistant so slow?” Last month I noticed that 80 % of the questions in the family chat-bot are “What’s for dinner?” or “Remind me to call Grandma.” Those fit in a 1.1 B-parameter q4_0 model that warms up in 38 ms. The other 20 % are “Explain this Rust lifetime error” or “Summarise this 30-page PDF,” jobs that drown in a 1 B model but fly with a 7 B q5_K_M.

I already had a routing layer (tiny Go service on :11435) that chooses which container gets the request. The missing piece was a dirt-simple classifier that lives beside Ollama and decides before we load weights. No cloud, no “LLM judges LLM” loops—just a 200 kB scikit-learn model that cost me one caffeinated evening to train.

Implementing automatic model selection based on query complexity: using lightweight classifiers to route requests between quantized and full-precision models in Ollama

My exact pipeline

1. Harvesting real queries

I enabled Ollama’s debug log (OLLAMA_DEBUG=1) and piped stderr through a small Cronicle job every night. After two weeks I had 4 812 requests. I anonymised them with a local Rust CLI that replaces names, SSIDs, file paths and git hashes with placeholders. Result: 4 812 rows, two columns only:

text,label
"what is for dinner tonight",light
"explain rust lifetime error in main.rs",heavy
...

2. Hand-labelling without going insane

I sampled 500 rows, skimmed the text, and tagged them light if a 1 B q4_0 answered acceptably in under 1 s on the N100. Anything that needed more world knowledge or longer than 1 s became heavy. That took 45 min with a tiny Tkinter UI I copy-pasted from a 2019 notebook. I did NOT label the full 4 812; 500 was enough because the decision surface is crude.

3. Training a model that fits in L2 cache

I tried three options:

Logistic-regression on TF-IDF → 93 % accuracy, 180 kB.
FastText supervised → 91 %, 350 kB.
A 3-layer CNN in Keras → 95 %, 2.1 MB.

I kept the logistic regression. The others were heavier with no user-visible gain. Hyper-parameters: min-df 3, n-gram 1-2, C 4.0, no lemmatiser (I don’t have spaCy on the router). Training time on the N100: 0.7 s. I pickled the vectoriser and coef_ to /etc/ollama-router/model.pkl (0640, ollama:ollama).

4. Wiring it into the routing layer

The Go service already inspects every POST to /api/chat. I added a 12-line function:

func routeModel(prompt string) string {
    if modelPicker.Predict([]string{prompt})[0] == "light" {
        return "llama3.2:1b-q4_0"
    }
    return "llama3.1:7b-q5_K_M"
}

The first call loads the 1 B weights; the heavy model stays cold until needed. Average RAM on idle dropped from 4.9 GB to 1.3 GB, and the box no longer swaps when Plex transcodes.

What actually failed

Sentence-BERT (all-MiniLM-L6-v2) gave 97 % accuracy but added 80 ms just to embed one sentence—longer than running the 1 B model. I rolled back.
Quantizing the classifier with skl2onnx crashed on an old protobuf pin required by another container. I stayed with pickle; the file is tiny anyway.
My first labelling pass used latency instead of quality. That routed “print hello world” to the 7 B model because the JVM was warming up. I re-labelled with cold-start times.

Numbers I measured, not copied

Metric	Before	After
mean first-token latency (light queries)	1.34 s	0.38 s
mean RAM at idle	4.9 GB	1.3 GB
CPU steal during 24 h (Proxmox)	18 %	4 %

Power at the wall fell 4 W; the UPS now lasts 82 min instead of 71 min during outages.

What I still don’t know

I have only two weeks of post-change logs. If query patterns shift (summer holidays → more homework questions) the 1 B model will start failing. I set a cron job that retrains the classifier every Sunday at 02:00 with the last 7 days of logs. So far drift is zero, but I’ll watch it.

Key takeaways

A 200 kB scikit model is enough to steer big-Small LLM pairs if your categories are coarse.
Measure cold-start, not warm-cache, or you’ll fool yourself.
Pickle is fine on a single-user box; don’t cargo-cort ONNX if you don’t need cross-language.
Keep the training pipeline in Cronicle so you actually rerun it.

The patch is three files: train.py (45 lines), router.go (+12 lines), and model.pkl (180 kB). It’s been quiet in the closet for 14 days—no complaints from the kids, no swap storms. That’s good enough for me.

Tech Expert & Vibe Coder

Implementing automatic model selection based on query complexity: using lightweight classifiers to route requests between quantized and full-precision models in Ollama

Why I wired Ollama to pick its own model

My exact pipeline

1. Harvesting real queries

2. Hand-labelling without going insane

3. Training a model that fits in L2 cache

4. Wiring it into the routing layer

What actually failed

Numbers I measured, not copied

What I still don’t know

Key takeaways

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I wired Ollama to pick its own model

My exact pipeline

1. Harvesting real queries

2. Hand-labelling without going insane

3. Training a model that fits in L2 cache

4. Wiring it into the routing layer

What actually failed

Numbers I measured, not copied

What I still don’t know

Key takeaways

Setting up hybrid inference pipelines: routing complex reasoning tasks to DeepSeek-V3 while keeping simple queries on local Llama models

Debugging Docker DNS Resolution Failures in Multi-Host Overlay Networks: Fixing Service Discovery Between Swarm Nodes

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Setting up hybrid inference pipelines: routing complex reasoning tasks to...

Debugging token generation slowdowns in LM Studio after extended uptime:...

Building AI-powered mathematical proof assistants with local LLMs: implementing...

Get new posts and practical tech notes in your inbox.