Debugging CUDA out-of-memory errors when running multiple concurrent Ollama instances with shared GPU memory pools on Proxmox

Why I Worked on This

I run Ollama on Proxmox with two different GPUs: an RTX 2060 (6GB VRAM) and an RTX 3090 (24GB VRAM). For months, this worked fine. Then after an Ollama update, my setup started crashing with CUDA out-of-memory errors. Models that previously loaded across both cards suddenly couldn’t fit, even though the total VRAM hadn’t changed.

This wasn’t theoretical. My n8n workflows that called Ollama started failing mid-execution. I needed to understand what changed and how to work around it.

My Real Setup

I’m running:

Proxmox 8.x host with GPU passthrough
Ubuntu 22.04 LXC container with both GPUs passed through
NVIDIA driver 545.23.08, CUDA 12.3
Ollama installed directly in the container (not Docker)
Two GPUs visible via CUDA_VISIBLE_DEVICES

Before the issue, I was running Ollama 0.1.18. The model I tested with was a Q4_K quantized Mixtral variant (about 24GB total size). Ollama would split layers across both GPUs, using most of the 3090’s memory and a chunk of the 2060’s. It worked.

What Broke

After updating to Ollama 0.1.20, the same model crashed immediately with:

CUDA error 2 at ggml-cuda.cu:9007: out of memory
current device: 1

The logs showed Ollama trying to offload 31 of 33 layers and claiming it needed 24.2GB of VRAM. It was trying to allocate memory evenly across both cards, which doesn’t work when one card only has 6GB.

Even more confusing: changing the GPU order with CUDA_VISIBLE_DEVICES=1,0 (putting the 3090 first) also crashed. In version 0.1.18, order mattered. In 0.1.20, neither order worked.

What I Tried

Forcing a Single GPU

I set CUDA_VISIBLE_DEVICES=1 to use only the 3090. This worked but defeated the purpose of having two GPUs. The 2060 sat idle.

Checking Memory Allocation

I ran nvidia-smi during load. With both GPUs enabled, Ollama was trying to split memory roughly equally:

RTX 2060: 5719 MiB used / 6144 MiB total (93% full)
RTX 3090: 20389 MiB used / 24576 MiB total (83% full)

The 2060 was maxed out with only 206 MiB free. Any context growth or additional model layers pushed it over the edge.

Lowering Layer Offload

I tried setting num_gpu in the Modelfile to manually control how many layers went to the GPU. This didn’t help because Ollama was still trying to allocate memory across both cards, and the smaller card couldn’t hold its share.

Reverting to 0.1.18

I downgraded to the last working version. It loaded the model successfully but still crashed occasionally under heavy use or large context windows. The 2060 was always the bottleneck.

What Actually Worked

I ended up with two solutions depending on the use case:

Single GPU for Stability

For production workflows (n8n, Cronicle jobs), I run Ollama with only the 3090:

CUDA_VISIBLE_DEVICES=1 ollama serve

This is stable and predictable. The 2060 isn’t used, but I don’t have crashes.

Dual GPU with Smaller Models

For testing or smaller models (under 15GB total), I still use both GPUs with Ollama 0.1.18. I keep the 3090 first in CUDA_VISIBLE_DEVICES=1,0 and monitor memory usage with nvidia-smi. If the 2060 starts hitting 90%+ utilization, I know I’m close to a crash.

What Didn’t Work

Trying to use both GPUs with newer Ollama versions (0.1.19+) and large models. The memory scheduler assumes GPUs are similar in size and splits work evenly. This breaks with asymmetric cards.

I also tried adjusting context length and batch size via Ollama’s API, thinking I could reduce memory pressure. It didn’t matter. The initial model load itself was the problem.

Why This Happens

From what I can tell, Ollama’s multi-GPU support changed between 0.1.18 and 0.1.20. The older version had a simpler scheduler that just filled up GPUs in order until it ran out of space. The newer version tries to balance load, which makes sense for identical GPUs but fails when one card is much smaller.

The error message is accurate: Ollama is asking CUDA to allocate more memory than the 2060 has. It’s not a bug in the traditional sense—it’s a limitation of how the scheduler works with mixed GPU sizes.

Key Takeaways

If you’re running multiple GPUs with different VRAM sizes:

Newer Ollama versions (0.1.19+) don’t handle asymmetric GPUs well
Using a single GPU is more stable than fighting the scheduler
If you need both GPUs, stick with older versions and monitor memory closely
GPU order matters in some versions but not others—test both configurations
The CUDA error happens at model load time, not during inference, so you’ll know immediately if it’s going to crash

I still use the 2060 for other tasks (local Stable Diffusion, video transcoding), but for Ollama specifically, I’ve accepted that the 3090 alone is more reliable than trying to use both cards together.

Current State

I’m running Ollama 0.1.18 in a separate LXC container for dual-GPU experiments and a newer version with single-GPU config for production. It’s not elegant, but it works. I check the Ollama GitHub issues occasionally to see if asymmetric GPU support improves, but as of my last update, it’s still a known limitation.

If you’re setting up Ollama on Proxmox with mixed GPUs, save yourself the debugging time: start with the larger GPU only, confirm stability, then decide if the smaller GPU is worth the added complexity.

Tech Expert & Vibe Coder

Why I Worked on This

My Real Setup

What Broke

What I Tried

Forcing a Single GPU

Checking Memory Allocation

Lowering Layer Offload

Reverting to 0.1.18

What Actually Worked

Single GPU for Stability

Dual GPU with Smaller Models

What Didn’t Work

Why This Happens

Key Takeaways

Current State

Category:

Implementing automatic model...

Setting up hybrid inference...

Leave a Comment Cancel reply

Categories

Related Posts

Implementing automatic model selection based on...

Setting up hybrid inference pipelines: routing...

Debugging token generation slowdowns in LM Studio...

About Me

Vipin PG

Tech Expert & Vibe Coder

Debugging CUDA out-of-memory errors when running multiple concurrent Ollama instances with shared GPU memory pools on Proxmox

Why I Worked on This

My Real Setup

What Broke

What I Tried

Forcing a Single GPU

Checking Memory Allocation

Lowering Layer Offload

Reverting to 0.1.18

What Actually Worked

Single GPU for Stability

Dual GPU with Smaller Models

What Didn’t Work

Why This Happens

Key Takeaways

Current State

Category:

Implementing automatic model...

Setting up hybrid inference...

Leave a Comment Cancel reply

Subscribe to Newsletter

Categories

Related Posts

Implementing automatic model selection based on...

Setting up hybrid inference pipelines: routing...

Debugging token generation slowdowns in LM Studio...

About Me

Vipin PG