Why I wired Ollama to pick its own model
I run Ollama on a fanless N100 box in the living-room closet. The CPU has AVX-VNNI but no dGPU, so every millisecond counts when the kids yell “Hey Dad, why is the assistant so slow?” Last month I noticed that 80 % of the questions in the family chat-bot are “What’s for dinner?” or “Remind me to call Grandma.” Those fit in a 1.1 B-parameter q4_0 model that warms up in 38 ms. The other 20 % are “Explain this Rust lifetime error” or “Summarise this 30-page PDF,” jobs that drown in a 1 B model but fly with a 7 B q5_K_M.
I already had a routing layer (tiny Go service on :11435) that chooses which container gets the request. The missing piece was a dirt-simple classifier that lives beside Ollama and decides before we load weights. No cloud, no “LLM judges LLM” loops—just a 200 kB scikit-learn model that cost me one caffeinated evening to train.
My exact pipeline
1. Harvesting real queries
I enabled Ollama’s debug log (OLLAMA_DEBUG=1) and piped stderr through a small Cronicle job every night. After two weeks I had 4 812 requests. I anonymised them with a local Rust CLI that replaces names, SSIDs, file paths and git hashes with placeholders. Result: 4 812 rows, two columns only:
text,label "what is for dinner tonight",light "explain rust lifetime error in main.rs",heavy ...
2. Hand-labelling without going insane
I sampled 500 rows, skimmed the text, and tagged them light if a 1 B q4_0 answered acceptably in under 1 s on the N100. Anything that needed more world knowledge or longer than 1 s became heavy. That took 45 min with a tiny Tkinter UI I copy-pasted from a 2019 notebook. I did NOT label the full 4 812; 500 was enough because the decision surface is crude.
3. Training a model that fits in L2 cache
I tried three options:
- Logistic-regression on TF-IDF → 93 % accuracy, 180 kB.
- FastText supervised → 91 %, 350 kB.
- A 3-layer CNN in Keras → 95 %, 2.1 MB.
I kept the logistic regression. The others were heavier with no user-visible gain. Hyper-parameters: min-df 3, n-gram 1-2, C 4.0, no lemmatiser (I don’t have spaCy on the router). Training time on the N100: 0.7 s. I pickled the vectoriser and coef_ to /etc/ollama-router/model.pkl (0640, ollama:ollama).
4. Wiring it into the routing layer
The Go service already inspects every POST to /api/chat. I added a 12-line function:
func routeModel(prompt string) string {
if modelPicker.Predict([]string{prompt})[0] == "light" {
return "llama3.2:1b-q4_0"
}
return "llama3.1:7b-q5_K_M"
}
The first call loads the 1 B weights; the heavy model stays cold until needed. Average RAM on idle dropped from 4.9 GB to 1.3 GB, and the box no longer swaps when Plex transcodes.
What actually failed
- Sentence-BERT (all-MiniLM-L6-v2) gave 97 % accuracy but added 80 ms just to embed one sentence—longer than running the 1 B model. I rolled back.
- Quantizing the classifier with
skl2onnxcrashed on an old protobuf pin required by another container. I stayed with pickle; the file is tiny anyway. - My first labelling pass used latency instead of quality. That routed “print hello world” to the 7 B model because the JVM was warming up. I re-labelled with cold-start times.
Numbers I measured, not copied
| Metric | Before | After |
|---|---|---|
| mean first-token latency (light queries) | 1.34 s | 0.38 s |
| mean RAM at idle | 4.9 GB | 1.3 GB |
| CPU steal during 24 h (Proxmox) | 18 % | 4 % |
Power at the wall fell 4 W; the UPS now lasts 82 min instead of 71 min during outages.
What I still don’t know
I have only two weeks of post-change logs. If query patterns shift (summer holidays → more homework questions) the 1 B model will start failing. I set a cron job that retrains the classifier every Sunday at 02:00 with the last 7 days of logs. So far drift is zero, but I’ll watch it.
Key takeaways
- A 200 kB scikit model is enough to steer big-Small LLM pairs if your categories are coarse.
- Measure cold-start, not warm-cache, or you’ll fool yourself.
- Pickle is fine on a single-user box; don’t cargo-cort ONNX if you don’t need cross-language.
- Keep the training pipeline in Cronicle so you actually rerun it.
The patch is three files: train.py (45 lines), router.go (+12 lines), and model.pkl (180 kB). It’s been quiet in the closet for 14 days—no complaints from the kids, no swap storms. That’s good enough for me.