Why I Started Looking Into This
I run Ollama on a Proxmox VM with GPU passthrough, using a single RTX 3060 with 12GB VRAM. My workflow involves switching between different models throughout the day—sometimes a 7B parameter model for quick queries, other times a 13B model for more complex reasoning tasks.
The problem started when I tried to keep multiple models loaded simultaneously. I wanted to avoid the 10-15 second reload time when switching contexts. What I got instead were silent crashes, incomplete responses that just stopped mid-sentence, and eventually full OOM (out of memory) errors that killed the Ollama service entirely.
The frustrating part wasn’t just the crashes—it was how unpredictable they were. Sometimes two 7B models would run fine. Other times, loading a second model would immediately fail. I needed to understand what was actually happening with VRAM allocation.
My Setup and Initial Assumptions
My Ollama instance runs in a Debian 12 VM with:
- RTX 3060 12GB passed through via PCIe
- 32GB system RAM allocated to the VM
- NVIDIA driver 535.154.05
- Ollama version 0.1.29 (later upgraded to 0.1.32)
I initially assumed Ollama would intelligently manage VRAM like any other resource pool—load what fits, offload what doesn’t, maybe use system RAM as overflow. That assumption was wrong.
GGUF models in Ollama don’t work like traditional applications. Each model loads its layers into VRAM based on the quantization format (Q4_K_M, Q5_K_M, etc.), and Ollama tries to keep them resident for performance. There’s no automatic swapping or graceful degradation when you hit limits.
What Actually Happens During Multi-Model Loading
I started monitoring VRAM usage with nvidia-smi dmon -s u running in a separate terminal while loading models. Here’s what I observed:
When loading the first model (Mistral 7B Q4_K_M), VRAM usage jumped to about 4.2GB. Expected. When I sent a query to a second model (Llama2 7B Q5_K_M) without unloading the first, usage climbed to 8.9GB. Still within limits.
The crash happened when I tried a third model. VRAM usage spiked briefly to 11.8GB, then the Ollama process terminated without any error in the main logs. I had to check dmesg to see the OOM killer had stepped in:
[1234567.890] Out of memory: Killed process 12345 (ollama) total-vm:45678912kB
The issue wasn’t just VRAM—it was also system RAM. Ollama loads model weights into system memory first, then copies them to VRAM. When multiple models are active, both pools fill up simultaneously.
The Hidden Overhead
What I didn’t account for initially was overhead. Each loaded model doesn’t just consume its stated size. There’s:
- Context buffer allocation (scales with context length)
- KV cache for each active conversation
- CUDA kernel overhead
- Ollama’s own management structures
A 7B model quantized to Q4_K_M might be listed as 4GB, but actually consumes 4.8-5.2GB when loaded with default context settings. Multiply that by three models and I was exceeding my 12GB VRAM before accounting for any actual inference work.
What Worked: Practical Limits and Configuration
I tested different combinations systematically, logging VRAM usage for each. Here’s what actually fits on my 12GB card:
Stable configurations:
- Two 7B Q4_K_M models (uses ~9.5GB VRAM)
- One 13B Q4_K_M + one 7B Q4_K_M (uses ~11.2GB VRAM)
- One 7B Q5_K_M + one 7B Q4_K_M (uses ~10.1GB VRAM)
Configurations that crashed:
- Three 7B Q4_K_M models
- Two 13B models of any quantization
- Any combination when setting context length above 4096
The key fix was configuring Ollama’s model management. I edited the systemd service file to set environment variables:
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_NUM_PARALLEL=2"
This tells Ollama to never keep more than two models resident and to limit parallel requests. When a third model is requested, it automatically unloads the least recently used model first.
I also reduced the default context window in my Modelfile configurations:
PARAMETER num_ctx 2048
Going from 4096 to 2048 tokens reduced per-model VRAM usage by about 600-800MB. For most of my queries, 2048 tokens is sufficient.
Monitoring That Actually Helps
I wrote a simple bash script that runs every 30 seconds via cron to log VRAM usage:
#!/bin/bash
timestamp=$(date +%Y-%m-%d %H:%M:%S)
vram=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
echo "$timestamp,$vram" >> /var/log/ollama_vram.log
This gave me historical data to correlate crashes with actual memory pressure. I found that crashes consistently happened when usage exceeded 11.5GB, not at the theoretical 12GB limit. There’s some reserved memory I can’t use.
What Didn’t Work
I tried several approaches that seemed logical but failed in practice:
Attempt 1: Using system RAM as overflow
I set OLLAMA_MAX_VRAM=8GB hoping Ollama would use system RAM for the remainder. Instead, inference became unusably slow (30+ seconds per token) because of constant CPU/GPU transfers. The performance hit made this pointless.
Attempt 2: Smaller quantizations
I tried Q3_K_S quantized models to save VRAM. They loaded fine, but quality dropped noticeably—more repetition, less coherent long-form responses. The memory savings (about 1GB per 7B model) wasn’t worth the quality loss for my use cases.
Attempt 3: Dynamic context scaling
I thought I could set different context lengths per model based on typical query types. The problem is Ollama allocates context buffers at model load time, not per request. So even if I only used 512 tokens, it still reserved space for the full configured context.
Attempt 4: Aggressive model unloading
I set OLLAMA_KEEP_ALIVE=0 to unload models immediately after each request. This prevented crashes but made the system frustrating to use. Every query had a 10-15 second startup delay. I went back to OLLAMA_KEEP_ALIVE=5m as a compromise.
The Real Bottleneck
After weeks of testing, the honest conclusion is that 12GB VRAM is enough for serious work with one model at a time, or light work with two models. Running three or more models simultaneously requires either:
- A card with 24GB+ VRAM (3090, 4090, A5000)
- Multiple GPUs with proper load balancing
- Accepting the reload delay when switching models
I chose option three. I keep two models loaded—usually a general-purpose 7B and a specialized 13B for coding tasks. When I need a third model, I accept the reload time. It’s not elegant, but it’s stable.
Key Takeaways
Measure before assuming. The stated model size is not the actual VRAM usage. Always add 15-25% overhead for context buffers and management structures.
Set hard limits. Ollama’s default behavior is to load whatever you request until something breaks. Use OLLAMA_MAX_LOADED_MODELS to enforce boundaries before hitting OOM conditions.
Context length matters more than I expected. Cutting context from 4096 to 2048 tokens saved almost as much VRAM as dropping from Q5 to Q4 quantization, with no quality impact for shorter queries.
The OOM killer is silent. Ollama doesn’t log memory pressure warnings before crashing. Monitor VRAM externally if you’re running near capacity.
There’s no free lunch with system RAM overflow. CPU-based inference is 50-100x slower than GPU. If a model doesn’t fit in VRAM, unload something else rather than trying to split it.
I still hit limits occasionally when I forget and try to load a third model. But now I understand why it fails, and I have monitoring in place to catch it before the OOM killer does. That’s enough for my workflow.