Debugging out-of-memory crashes when running multiple GGUF models simultaneously in Ollama with shared VRAM pools

Why I Started Looking Into This

I run Ollama on a Proxmox VM with GPU passthrough, using a single RTX 3060 with 12GB VRAM. My workflow involves switching between different models throughout the day—sometimes a 7B parameter model for quick queries, other times a 13B model for more complex reasoning tasks.

The problem started when I tried to keep multiple models loaded simultaneously. I wanted to avoid the 10-15 second reload time when switching contexts. What I got instead were silent crashes, incomplete responses that just stopped mid-sentence, and eventually full OOM (out of memory) errors that killed the Ollama service entirely.

Debugging out-of-memory crashes when running multiple GGUF models simultaneously in Ollama with shared VRAM pools

The frustrating part wasn’t just the crashes—it was how unpredictable they were. Sometimes two 7B models would run fine. Other times, loading a second model would immediately fail. I needed to understand what was actually happening with VRAM allocation.

My Setup and Initial Assumptions

My Ollama instance runs in a Debian 12 VM with:

RTX 3060 12GB passed through via PCIe
32GB system RAM allocated to the VM
NVIDIA driver 535.154.05
Ollama version 0.1.29 (later upgraded to 0.1.32)

I initially assumed Ollama would intelligently manage VRAM like any other resource pool—load what fits, offload what doesn’t, maybe use system RAM as overflow. That assumption was wrong.

GGUF models in Ollama don’t work like traditional applications. Each model loads its layers into VRAM based on the quantization format (Q4_K_M, Q5_K_M, etc.), and Ollama tries to keep them resident for performance. There’s no automatic swapping or graceful degradation when you hit limits.

What Actually Happens During Multi-Model Loading

I started monitoring VRAM usage with nvidia-smi dmon -s u running in a separate terminal while loading models. Here’s what I observed:

When loading the first model (Mistral 7B Q4_K_M), VRAM usage jumped to about 4.2GB. Expected. When I sent a query to a second model (Llama2 7B Q5_K_M) without unloading the first, usage climbed to 8.9GB. Still within limits.

The crash happened when I tried a third model. VRAM usage spiked briefly to 11.8GB, then the Ollama process terminated without any error in the main logs. I had to check dmesg to see the OOM killer had stepped in:

[1234567.890] Out of memory: Killed process 12345 (ollama) total-vm:45678912kB

The issue wasn’t just VRAM—it was also system RAM. Ollama loads model weights into system memory first, then copies them to VRAM. When multiple models are active, both pools fill up simultaneously.

The Hidden Overhead

What I didn’t account for initially was overhead. Each loaded model doesn’t just consume its stated size. There’s:

Context buffer allocation (scales with context length)
KV cache for each active conversation
CUDA kernel overhead
Ollama’s own management structures

A 7B model quantized to Q4_K_M might be listed as 4GB, but actually consumes 4.8-5.2GB when loaded with default context settings. Multiply that by three models and I was exceeding my 12GB VRAM before accounting for any actual inference work.

What Worked: Practical Limits and Configuration

I tested different combinations systematically, logging VRAM usage for each. Here’s what actually fits on my 12GB card:

Stable configurations:

Two 7B Q4_K_M models (uses ~9.5GB VRAM)
One 13B Q4_K_M + one 7B Q4_K_M (uses ~11.2GB VRAM)
One 7B Q5_K_M + one 7B Q4_K_M (uses ~10.1GB VRAM)

Configurations that crashed:

Three 7B Q4_K_M models
Two 13B models of any quantization
Any combination when setting context length above 4096

The key fix was configuring Ollama’s model management. I edited the systemd service file to set environment variables:

Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_NUM_PARALLEL=2"

This tells Ollama to never keep more than two models resident and to limit parallel requests. When a third model is requested, it automatically unloads the least recently used model first.

I also reduced the default context window in my Modelfile configurations:

PARAMETER num_ctx 2048

Going from 4096 to 2048 tokens reduced per-model VRAM usage by about 600-800MB. For most of my queries, 2048 tokens is sufficient.

Monitoring That Actually Helps

I wrote a simple bash script that runs every 30 seconds via cron to log VRAM usage:

#!/bin/bash
timestamp=$(date +%Y-%m-%d %H:%M:%S)
vram=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
echo "$timestamp,$vram" >> /var/log/ollama_vram.log

This gave me historical data to correlate crashes with actual memory pressure. I found that crashes consistently happened when usage exceeded 11.5GB, not at the theoretical 12GB limit. There’s some reserved memory I can’t use.

What Didn’t Work

I tried several approaches that seemed logical but failed in practice:

Attempt 1: Using system RAM as overflow

I set OLLAMA_MAX_VRAM=8GB hoping Ollama would use system RAM for the remainder. Instead, inference became unusably slow (30+ seconds per token) because of constant CPU/GPU transfers. The performance hit made this pointless.

Attempt 2: Smaller quantizations

I tried Q3_K_S quantized models to save VRAM. They loaded fine, but quality dropped noticeably—more repetition, less coherent long-form responses. The memory savings (about 1GB per 7B model) wasn’t worth the quality loss for my use cases.

Attempt 3: Dynamic context scaling

I thought I could set different context lengths per model based on typical query types. The problem is Ollama allocates context buffers at model load time, not per request. So even if I only used 512 tokens, it still reserved space for the full configured context.

Attempt 4: Aggressive model unloading

I set OLLAMA_KEEP_ALIVE=0 to unload models immediately after each request. This prevented crashes but made the system frustrating to use. Every query had a 10-15 second startup delay. I went back to OLLAMA_KEEP_ALIVE=5m as a compromise.

The Real Bottleneck

After weeks of testing, the honest conclusion is that 12GB VRAM is enough for serious work with one model at a time, or light work with two models. Running three or more models simultaneously requires either:

A card with 24GB+ VRAM (3090, 4090, A5000)
Multiple GPUs with proper load balancing
Accepting the reload delay when switching models

I chose option three. I keep two models loaded—usually a general-purpose 7B and a specialized 13B for coding tasks. When I need a third model, I accept the reload time. It’s not elegant, but it’s stable.

Key Takeaways

Measure before assuming. The stated model size is not the actual VRAM usage. Always add 15-25% overhead for context buffers and management structures.

Set hard limits. Ollama’s default behavior is to load whatever you request until something breaks. Use OLLAMA_MAX_LOADED_MODELS to enforce boundaries before hitting OOM conditions.

Context length matters more than I expected. Cutting context from 4096 to 2048 tokens saved almost as much VRAM as dropping from Q5 to Q4 quantization, with no quality impact for shorter queries.

The OOM killer is silent. Ollama doesn’t log memory pressure warnings before crashing. Monitor VRAM externally if you’re running near capacity.

There’s no free lunch with system RAM overflow. CPU-based inference is 50-100x slower than GPU. If a model doesn’t fit in VRAM, unload something else rather than trying to split it.

I still hit limits occasionally when I forget and try to load a third model. But now I understand why it fails, and I have monitoring in place to catch it before the OOM killer does. That’s enough for my workflow.

Tech Expert & Vibe Coder

Debugging out-of-memory crashes when running multiple GGUF models simultaneously in Ollama with shared VRAM pools

Why I Started Looking Into This

My Setup and Initial Assumptions

What Actually Happens During Multi-Model Loading

The Hidden Overhead

What Worked: Practical Limits and Configuration

Monitoring That Actually Helps

What Didn’t Work

The Real Bottleneck

Key Takeaways

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Started Looking Into This

My Setup and Initial Assumptions

What Actually Happens During Multi-Model Loading

The Hidden Overhead

What Worked: Practical Limits and Configuration

Monitoring That Actually Helps

What Didn’t Work

The Real Bottleneck

Key Takeaways

Implementing token-based cost tracking for self-hosted LLM APIs using Prometheus and Grafana to monitor usage patterns

Setting up automated model quantization pipelines with llama.cpp to convert new Hugging Face releases for local deployment

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Implementing automatic model selection based on query complexity: using...

Setting up hybrid inference pipelines: routing complex reasoning tasks to...

Debugging token generation slowdowns in LM Studio after extended uptime:...

Get new posts and practical tech notes in your inbox.