Optimizing RTX 5090 VRAM allocation for parallel LLM inference: running multiple Ollama models simultaneously without memory thrashing

Why I’m Running Multiple LLMs at Once

I run several LLM workflows on my home server. Some handle code generation, others process documentation, and a few manage conversational tasks through n8n automations. The problem: these workflows don’t all need the same model, and they don’t run at predictable times.

Loading a different model each time a workflow triggers means waiting 10-30 seconds for VRAM allocation and model initialization. When three workflows fire within minutes of each other, that delay compounds. I needed multiple models ready in memory simultaneously.

Optimizing RTX 5090 VRAM allocation for parallel LLM inference: running multiple Ollama models simultaneously without memory thrashing

My RTX 5090 has 32GB of VRAM. On paper, that’s enough for several 7B-13B models. In practice, I hit memory thrashing within days of my first attempt.

My Actual Hardware Setup

The server runs Proxmox with a dedicated Ubuntu 24.04 VM that has GPU passthrough configured. The RTX 5090 is the only GPU in the system, passed through completely to this VM.

I’m running Ollama 0.5.1 with CUDA 12.4. The VM has 64GB of system RAM, though Ollama primarily uses VRAM for model weights.

My typical model roster:

Qwen2.5-Coder 7B (code generation)
Llama 3.1 8B (general tasks)
Mistral 7B v0.3 (fast responses)
Deepseek-Coder 6.7B (code review)

All models use Q4_K_M quantization. That’s the sweet spot I found between quality and memory usage for my workflows.

What Didn’t Work: Naive Parallel Loading

My first approach was simple: keep all four models loaded by calling them in sequence at startup. Ollama’s default behavior keeps models in VRAM until memory pressure forces eviction.

This failed within 48 hours. Memory usage would climb to 28-29GB, then the system would start thrashing. Models would unload and reload unpredictably. Inference times spiked from 2 seconds to 30+ seconds as models fought for VRAM.

The problem wasn’t total memory—it was fragmentation and Ollama’s eviction policy. When a fifth model request came in (sometimes my workflows would retry with a different model), Ollama would try to load it, fail to find contiguous space, then start evicting models seemingly at random.

I also tried setting OLLAMA_MAX_LOADED_MODELS=4. This prevented the fifth model from loading but didn’t solve fragmentation. Models still unloaded under memory pressure I couldn’t predict.

Understanding VRAM Allocation in Ollama

I spent time watching nvidia-smi output during different operations. Here’s what actually happens:

When Ollama loads a model, it allocates VRAM in chunks. A 7B Q4_K_M model needs roughly 4-5GB depending on context length. But Ollama also reserves overhead—about 1-2GB per model for KV cache and processing buffers.

The KV cache size grows with context length. At 2048 tokens, it’s manageable. At 8192 tokens, it can add another 1-2GB per model. My workflows were using default context lengths, which varied by model.

Ollama’s eviction policy is LRU-based but doesn’t account for model reload cost. It would evict a 13B model to make room for a 7B model, even though reloading the 13B model later would be more expensive.

The Actual Memory Breakdown

For my four-model setup, real VRAM usage looked like this:

Model weights: ~18GB total (4-5GB each)
KV cache overhead: ~6GB (1.5GB per model at 4096 context)
CUDA overhead: ~2GB
Operating headroom: ~2GB

That’s 28GB under ideal conditions. Any context length spike or fifth model request would trigger thrashing.

What Actually Worked: Fixed Context Windows and Explicit Limits

I made three changes that stabilized the system:

1. Fixed Context Length Per Model

I set explicit context lengths in my n8n workflows instead of using defaults. Code generation gets 4096 tokens. Documentation processing gets 8192. Quick responses get 2048.

This made KV cache allocation predictable. No more surprise memory spikes when a workflow requested maximum context.

2. Model-Specific VRAM Budgets

I calculated actual VRAM needs per model by loading each one individually and monitoring nvidia-smi. Then I documented those numbers and designed workflows around them.

Qwen2.5-Coder at 4096 context: 5.2GB
Llama 3.1 at 4096 context: 5.8GB
Mistral at 2048 context: 4.1GB
Deepseek-Coder at 4096 context: 4.8GB

Total: 19.9GB for weights and KV cache, plus 2GB CUDA overhead. That left 10GB of headroom.

3. Startup Sequence with Verification

I wrote a bash script that loads models in sequence and verifies each one before continuing. It sends a small test prompt to each model and checks response time. If any model takes longer than 5 seconds to respond, the script stops and alerts me.

#!/bin/bash

models=("qwen2.5-coder:7b" "llama3.1:8b" "mistral:7b" "deepseek-coder:6.7b")

for model in "${models[@]}"; do
  echo "Loading $model..."
  start=$(date +%s)
  response=$(curl -s -X POST http://localhost:11434/api/generate 
    -d "{"model": "$model", "prompt": "test", "stream": false}")
  end=$(date +%s)
  duration=$((end - start))
  
  if [ $duration -gt 5 ]; then
    echo "ERROR: $model took ${duration}s to respond"
    exit 1
  fi
  
  echo "$model loaded successfully in ${duration}s"
  sleep 2
done

echo "All models loaded and verified"

This runs on VM startup via systemd. If it fails, I get a notification through n8n before any workflows try to use the models.

Monitoring That Actually Helped

I added a Cronicle job that runs every 5 minutes and logs VRAM usage per model. It parses nvidia-smi output and writes to a CSV file that I can graph later.

The key metric isn’t total VRAM usage—it’s fragmentation. If I see models reloading frequently (detected by sudden drops and spikes in per-process memory), something is wrong with my workflow design.

I also track inference time per model. If Qwen2.5-Coder suddenly takes 8 seconds instead of 2 seconds for the same prompt, that’s usually a sign of memory pressure even if total VRAM usage looks fine.

What I Still Can’t Solve

Ollama doesn’t expose fine-grained control over KV cache allocation. I can’t tell it “reserve exactly 1.5GB for this model’s KV cache and no more.” It manages that internally.

This means I have to leave more headroom than I’d like. With tighter control, I could probably fit a fifth model. As it stands, four is the reliable limit.

I also can’t prioritize which model stays in memory during pressure. If I could tell Ollama “never evict Qwen2.5-Coder unless absolutely necessary,” my code generation workflows would be more consistent. Right now, eviction is purely LRU.

Practical Limits I’ve Found

32GB VRAM can handle four 7B-class models at 4096 context with about 8GB of headroom. That headroom is necessary—not for normal operation, but for the occasional workflow that needs 8192 context or retries with a different model.

If I wanted to run larger models (13B+), I’d need to drop to three models maximum. A 13B Q4_K_M model uses 8-9GB with KV cache at 4096 context.

Smaller quantizations (Q3_K_M) save 15-20% VRAM but produce noticeably worse output for code generation. I tested this extensively. The quality loss isn’t worth the memory savings for my use case.

Key Takeaways

Fixed context lengths are non-negotiable. Variable context creates unpredictable memory usage that will eventually cause thrashing.

Measure actual VRAM usage per model under real workflow conditions. Don’t trust theoretical calculations or model card estimates.

Leave 20-25% VRAM as headroom. Ollama’s memory management isn’t perfect, and workflows don’t always behave as designed.

Monitor inference time, not just memory usage. Slow responses are often the first sign of memory pressure.

Four 7B-class models is the practical limit on 32GB VRAM with current Ollama versions. You can push to five if you’re willing to accept occasional reloads.

Tech Expert & Vibe Coder

Optimizing RTX 5090 VRAM allocation for parallel LLM inference: running multiple Ollama models simultaneously without memory thrashing

Why I’m Running Multiple LLMs at Once

My Actual Hardware Setup

What Didn’t Work: Naive Parallel Loading