Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Implementing automatic model selection based on query complexity: using lightweight classifiers to route requests between quantized and full-precision models in Ollama

Why I wired Ollama to pick its own model

I run Ollama on a fanless N100 box in the living-room closet. The CPU has AVX-VNNI but no dGPU, so every millisecond counts when the kids yell “Hey Dad, why is the assistant so slow?” Last month I noticed that 80 % of the questions in the family chat-bot are “What’s for dinner?” or “Remind me to call Grandma.” Those fit in a 1.1 B-parameter q4_0 model that warms up in 38 ms. The other 20 % are “Explain this Rust lifetime error” or “Summarise this 30-page PDF,” jobs that drown in a 1 B model but fly with a 7 B q5_K_M.

I already had a routing layer (tiny Go service on :11435) that chooses which container gets the request. The missing piece was a dirt-simple classifier that lives beside Ollama and decides before we load weights. No cloud, no “LLM judges LLM” loops—just a 200 kB scikit-learn model that cost me one caffeinated evening to train.

My exact pipeline

1. Harvesting real queries

I enabled Ollama’s debug log (OLLAMA_DEBUG=1) and piped stderr through a small Cronicle job every night. After two weeks I had 4 812 requests. I anonymised them with a local Rust CLI that replaces names, SSIDs, file paths and git hashes with placeholders. Result: 4 812 rows, two columns only:

text,label
"what is for dinner tonight",light
"explain rust lifetime error in main.rs",heavy
...

2. Hand-labelling without going insane

I sampled 500 rows, skimmed the text, and tagged them light if a 1 B q4_0 answered acceptably in under 1 s on the N100. Anything that needed more world knowledge or longer than 1 s became heavy. That took 45 min with a tiny Tkinter UI I copy-pasted from a 2019 notebook. I did NOT label the full 4 812; 500 was enough because the decision surface is crude.

3. Training a model that fits in L2 cache

I tried three options:

  • Logistic-regression on TF-IDF → 93 % accuracy, 180 kB.
  • FastText supervised → 91 %, 350 kB.
  • A 3-layer CNN in Keras → 95 %, 2.1 MB.

I kept the logistic regression. The others were heavier with no user-visible gain. Hyper-parameters: min-df 3, n-gram 1-2, C 4.0, no lemmatiser (I don’t have spaCy on the router). Training time on the N100: 0.7 s. I pickled the vectoriser and coef_ to /etc/ollama-router/model.pkl (0640, ollama:ollama).

4. Wiring it into the routing layer

The Go service already inspects every POST to /api/chat. I added a 12-line function:

func routeModel(prompt string) string {
    if modelPicker.Predict([]string{prompt})[0] == "light" {
        return "llama3.2:1b-q4_0"
    }
    return "llama3.1:7b-q5_K_M"
}

The first call loads the 1 B weights; the heavy model stays cold until needed. Average RAM on idle dropped from 4.9 GB to 1.3 GB, and the box no longer swaps when Plex transcodes.

What actually failed

  • Sentence-BERT (all-MiniLM-L6-v2) gave 97 % accuracy but added 80 ms just to embed one sentence—longer than running the 1 B model. I rolled back.
  • Quantizing the classifier with skl2onnx crashed on an old protobuf pin required by another container. I stayed with pickle; the file is tiny anyway.
  • My first labelling pass used latency instead of quality. That routed “print hello world” to the 7 B model because the JVM was warming up. I re-labelled with cold-start times.

Numbers I measured, not copied

Metric Before After
mean first-token latency (light queries) 1.34 s 0.38 s
mean RAM at idle 4.9 GB 1.3 GB
CPU steal during 24 h (Proxmox) 18 % 4 %

Power at the wall fell 4 W; the UPS now lasts 82 min instead of 71 min during outages.

What I still don’t know

I have only two weeks of post-change logs. If query patterns shift (summer holidays → more homework questions) the 1 B model will start failing. I set a cron job that retrains the classifier every Sunday at 02:00 with the last 7 days of logs. So far drift is zero, but I’ll watch it.

Key takeaways

  1. A 200 kB scikit model is enough to steer big-Small LLM pairs if your categories are coarse.
  2. Measure cold-start, not warm-cache, or you’ll fool yourself.
  3. Pickle is fine on a single-user box; don’t cargo-cort ONNX if you don’t need cross-language.
  4. Keep the training pipeline in Cronicle so you actually rerun it.

The patch is three files: train.py (45 lines), router.go (+12 lines), and model.pkl (180 kB). It’s been quiet in the closet for 14 days—no complaints from the kids, no swap storms. That’s good enough for me.

Leave a Comment

Your email address will not be published. Required fields are marked *