# Running Multiple Quantization Levels of the Same Model: Dynamic VRAM Allocation in Ollama for Speed vs Quality Tradeoffs
Why I Started Running Multiple Quantizations
I run Ollama on a Proxmox VM with an NVIDIA GPU passed through. My typical workflow involves both quick exploratory prompts and longer, quality-sensitive generation tasks. For months, I kept switching between different quantization levels of the same model—usually Llama or Mistral variants—by unloading one and loading another through the API or CLI.
This got tedious. I’d load a Q4_K_M version for speed when testing prompts or doing rough drafts, then realize I needed better output quality and have to stop, unload, and pull down a Q6_K or Q8 version. Each swap ate time and broke my flow.
I wanted both available simultaneously: a fast, lower-quality version for iteration and a slower, higher-fidelity version for final output. The constraint was VRAM. My GPU has 12GB, and loading two full models wasn’t always feasible depending on context length and concurrent requests.
How Ollama Handles Model Loading
Ollama keeps models in memory after first use. When you send a request to a model, it loads into VRAM if not already there. If you don’t touch it for a while, Ollama will eventually unload it to free space, but this timeout is configurable and defaults to keeping models loaded.
The key behavior I rely on: Ollama can load multiple models at once, as long as there’s enough VRAM. Each quantization of a model is treated as a separate entity. So llama3.2:3b-instruct-q4_K_M and llama3.2:3b-instruct-q8_0 are independent in Ollama’s view, even though they’re the same architecture.
This means I can pre-load both and switch between them by simply changing the model name in my API calls or n8n workflows.
My Actual Setup
I run Ollama inside a Proxmox LXC container with GPU passthrough. The container has:
- 12GB VRAM from an NVIDIA RTX 3060
- 16GB system RAM allocated
- Ollama installed via the standard Linux install script
- Environment variable
OLLAMA_KEEP_ALIVE=-1set to keep models loaded indefinitely
The models I keep loaded simultaneously:
llama3.2:3b-instruct-q4_K_M– roughly 2.3GB VRAMllama3.2:3b-instruct-q8_0– roughly 3.8GB VRAM
Total VRAM usage when both are idle in memory: around 6.1GB. This leaves enough headroom for context and generation without swapping.
I interact with Ollama primarily through its HTTP API, called from n8n workflows. Each workflow specifies which model to use based on the task type.
Loading Models on Startup
Ollama doesn’t auto-load models on startup. I wrote a small bash script that runs after the Ollama service starts:
#!/bin/bash
curl -X POST http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b-instruct-q4_K_M",
"prompt": "test",
"stream": false
}'
curl -X POST http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b-instruct-q8_0",
"prompt": "test",
"stream": false
}'
This sends a trivial prompt to each model, forcing Ollama to load them into VRAM. I trigger this script manually after restarts or via a systemd service dependency if I need it automated.
The script takes about 15-20 seconds to complete, depending on model size. Once done, both models stay loaded until I explicitly unload them or restart Ollama.
Switching Between Quantizations in Practice
In my n8n workflows, I have a function node that sets the model name based on a simple flag:
const useHighQuality = $json.requireQuality || false;
const model = useHighQuality
? "llama3.2:3b-instruct-q8_0"
: "llama3.2:3b-instruct-q4_K_M";
return { model };
This gets passed to the HTTP request node that calls Ollama. For quick tasks like summarizing logs or generating test data, I leave requireQuality false. For content drafts or anything user-facing, I set it true.
The switch is instant because both models are already in VRAM. There’s no load time, no waiting for weights to transfer. The response latency difference is noticeable—Q4 generates roughly 40-50 tokens/sec on my setup, Q8 drops to around 25-30 tokens/sec—but that’s the tradeoff.
VRAM Monitoring and Limits
I use nvidia-smi to check VRAM usage:
watch -n 1 nvidia-smi
When both models are loaded but idle, I see around 6GB used. During active generation, this spikes depending on context length. A 4k context window adds roughly 1-1.5GB per model during inference.
If I try to load a third model or use very long contexts on both simultaneously, I hit VRAM limits. Ollama doesn’t crash, but performance degrades as it starts swapping to system RAM. I can see this in nvidia-smi when VRAM maxes out and generation slows to a crawl.
My rule: keep total loaded model size under 8GB to leave 4GB for context and generation overhead. This has been stable for months.
What Didn’t Work
I initially tried loading three quantizations: Q4, Q6, and Q8. The idea was to have a mid-tier option. In practice, this pushed VRAM usage too high. Even at idle, I was using 9-10GB, and any real workload caused swapping.
I also tried using Ollama’s num_gpu parameter to offload layers to CPU for one of the models, thinking I could keep more models loaded. This technically worked, but the CPU-offloaded model became so slow it was unusable. Generation dropped to 5-8 tokens/sec, which defeated the purpose.
Another failed experiment: dynamically unloading models between requests to free VRAM. I wrote a script to call /api/generate with an empty prompt and keep_alive=0 to unload a model, then reload it later. The unload worked, but reloading added 3-5 seconds of latency per request, which was worse than just keeping both loaded.
Quality Differences I Actually Notice
For most tasks, Q4 is fine. Summaries, simple rewrites, structured data extraction—there’s no meaningful quality loss. The output is coherent and follows instructions.
Where Q8 matters:
- Longer-form writing where tone and flow matter
- Tasks requiring nuanced reasoning or multi-step logic
- Anything I’m going to publish or send to someone else
The difference isn’t dramatic, but it’s there. Q8 produces slightly more natural phrasing and fewer awkward constructions. For a 500-word article draft, I’d estimate Q8 saves me 2-3 manual edits compared to Q4.
For code generation, I haven’t noticed a significant difference between Q4 and Q8 with smaller models. Both produce similar results for straightforward tasks. Larger models or more complex code might show a gap, but I haven’t tested that extensively.
Cost of This Approach
The main cost is VRAM. Running two quantizations means I can’t load a single larger model that might perform better overall. For example, I can’t fit a 7B Q6 model alongside these two 3B versions.
There’s also a slight complexity cost in managing which model gets used where. I have to remember to set the quality flag in my workflows, and I’ve occasionally run a high-quality task on the Q4 model by mistake.
Power usage is negligible—the GPU is already on, and idle VRAM usage doesn’t add meaningful draw.
When This Setup Makes Sense
This approach works for me because:
- I have predictable VRAM limits and know what fits
- My workload genuinely splits between speed-focused and quality-focused tasks
- I’m using the same base model, so switching is just a name change
- I’m calling Ollama programmatically, so routing logic is easy
If you’re mostly doing one type of task, or if you have enough VRAM to run a higher quantization at acceptable speed, you probably don’t need this. Just pick one quantization and stick with it.
If you’re VRAM-constrained and can only fit one model, dynamic loading/unloading might work, but expect latency hits.
Key Takeaways
- Ollama treats each quantization as a separate model, so you can load multiple versions of the same architecture simultaneously.
- Keeping models loaded with
OLLAMA_KEEP_ALIVE=-1eliminates switching latency but requires enough VRAM. - For a 12GB GPU, two 3B models (Q4 and Q8) fit comfortably with room for context.
- Quality differences between Q4 and Q8 are subtle but noticeable for writing and reasoning tasks.
- Dynamic unloading to save VRAM adds too much latency to be practical for my workflows.
- This setup makes sense if your workload genuinely splits between speed and quality needs.
I’ve been running this configuration for about four months. It’s stable, predictable, and saves me from constant model swapping. The tradeoff is VRAM allocation, but for my use case, that’s a reasonable cost.