Debugging CUDA Out-of-Memory Errors in Ollama Multi-Model Deployments: Memory Pooling Strategies for 24GB VRAM Limits

Why I Started Debugging CUDA Memory Errors

I run a Proxmox home server with an RTX 4090 passed through to a dedicated VM for local AI workloads. When I first tried running multiple Ollama models simultaneously—switching between a code assistant, a writing helper, and a general-purpose LLM—I hit CUDA out-of-memory errors constantly. The 24GB VRAM seemed generous until I realized each model was claiming memory and never fully releasing it.

This wasn’t about running one giant 70B model. It was about running three or four smaller models (7B to 13B) in rotation throughout the day without restarting the entire Ollama service every time I switched tasks. The memory pooling behavior was eating my VRAM alive.

Debugging CUDA Out-of-Memory Errors in Ollama Multi-Model Deployments: Memory Pooling Strategies for 24GB VRAM Limits

My Actual Setup

Here’s what I’m working with:

Proxmox VE 8.1 host
Ubuntu 22.04 VM with GPU passthrough (RTX 4090, 24GB VRAM)
Ollama 0.1.x running as a systemd service
CUDA 12.2 drivers
Typical workload: llama2:13b-q4_K_M, codellama:7b-instruct, mistral:7b-instruct

I monitor GPU usage with nvidia-smi in a tmux pane while working. What I noticed: even after switching models, the previous model’s memory allocation would linger. Three model switches later, I’d be at 22GB used with only one model supposedly active.

What Actually Causes the Memory Leak

Ollama doesn’t immediately free GPU memory when you stop using a model. It keeps models in VRAM for faster reloading—a reasonable optimization for single-model workflows, but disastrous for multi-model rotation.

When I ran:

ollama run llama2:13b-q4_K_M
# Use it for 10 minutes
# Exit the session
nvidia-smi

The memory stayed allocated. Running a second model would stack on top of the first. By the third model, I’d hit the ceiling.

The error looked like this:

CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call

Not helpful. No indication of which model was hogging memory or how to force cleanup.

Failed Approach: Environment Variables

I tried setting OLLAMA_GPU_LAYERS and OLLAMA_MAX_CONTEXT hoping to limit per-model memory usage:

export OLLAMA_GPU_LAYERS=20
export OLLAMA_MAX_CONTEXT=2048
ollama run codellama:7b-instruct

This reduced initial memory allocation slightly but didn’t solve the pooling problem. Old models still hung around in VRAM. The environment variables control how much a model uses, not when it gets evicted.

What Actually Worked: Explicit Model Unloading

Ollama doesn’t have a built-in “unload model” command in the CLI. But it does respond to the API. I wrote a simple bash function to force unload:

unload_model() {
    curl -X POST http://localhost:11434/api/generate 
         -H "Content-Type: application/json" 
         -d '{"model": "'"$1"'", "keep_alive": 0}'
}

# Usage:
unload_model "llama2:13b-q4_K_M"

Setting keep_alive: 0 tells Ollama to immediately release the model from memory. This actually freed VRAM. I verified with nvidia-smi—memory dropped from 14GB to 2GB after unloading.

Building a Model Rotation Script

Since I switch models multiple times a day, I automated the unload process. Here’s my actual script (not theoretical—I use this daily):

#!/bin/bash
# ~/bin/ollama-switch

CURRENT_MODEL_FILE="/tmp/ollama_current_model"

# Unload previous model if exists
if [ -f "$CURRENT_MODEL_FILE" ]; then
    PREV_MODEL=$(cat "$CURRENT_MODEL_FILE")
    echo "Unloading previous model: $PREV_MODEL"
    curl -s -X POST http://localhost:11434/api/generate 
         -H "Content-Type: application/json" 
         -d '{"model": "'"$PREV_MODEL"'", "keep_alive": 0}' > /dev/null
fi

# Load new model
NEW_MODEL=$1
echo "Loading model: $NEW_MODEL"
echo "$NEW_MODEL" > "$CURRENT_MODEL_FILE"

ollama run "$NEW_MODEL"

I call this with:

ollama-switch llama2:13b-q4_K_M
# Later...
ollama-switch codellama:7b-instruct

Each switch unloads the previous model before loading the next. No more memory accumulation.

Monitoring Memory Usage Properly

I added a persistent nvidia-smi monitor in tmux:

watch -n 2 'nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv,noheader,nounits'

This shows me in real-time:

Current VRAM usage
Total available
GPU utilization percentage

When I switch models with my script, I see memory drop immediately. Without the script, it just climbs.

Quantization Trade-offs I Actually Encountered

I tested different quantization levels to see if I could fit more models in memory simultaneously:

q4_K_M: 7B model uses ~4.5GB, 13B uses ~8GB. Quality is acceptable for most tasks.
q5_K_M: 7B model uses ~5.5GB, 13B uses ~10GB. Slightly better responses, not always noticeable.
q8_0: 7B model uses ~7GB. I couldn’t tell the difference from q5_K_M in my use cases.

For multi-model rotation, q4_K_M is the sweet spot. I can theoretically keep 5 models loaded if I don’t unload between switches, but that defeats the purpose. Better to use 2-3 models with proper unloading than to cram 5 and hope.

CPU Offloading: When I Actually Use It

I don’t use CPU offloading for my regular workflow. The performance drop is too severe—from ~60 tokens/second on pure GPU to ~8 tokens/second with mixed CPU/GPU. But I do use it for one specific case: running a background embedding model while using a main chat model.

export OLLAMA_GPU_LAYERS=10
ollama run nomic-embed-text &  # Runs in background with partial GPU
ollama run llama2:13b-q4_K_M   # Main model gets most VRAM

This keeps the embedding model responsive enough for quick lookups without hogging all VRAM. If I didn’t need the embedding model active, I wouldn’t bother with CPU offloading.

What Doesn’t Work: Modelfile Parameter Tuning

I tried creating custom Modelfiles with reduced context windows and batch sizes:

FROM llama2:13b-q4_K_M
PARAMETER num_ctx 1024
PARAMETER num_batch 256
PARAMETER num_gpu 20

This reduced memory usage by maybe 1-2GB. Not worth the effort for my use case. The real problem was memory pooling, not per-model allocation. Tuning parameters didn’t address that.

Multi-GPU Setup: I Don’t Have One

I only have one RTX 4090. If I had two, I’d probably dedicate one to persistent models and one to rotating models. But I haven’t tested this. Most advice online about multi-GPU Ollama setups is theoretical or marketing material for expensive hardware.

Key Takeaways

Ollama’s memory pooling is aggressive. Models don’t auto-unload.
Explicit unloading via API (keep_alive: 0) is the only reliable way to free VRAM.
Environment variables control per-model limits, not eviction behavior.
Quantization helps, but q4_K_M is sufficient for most work—going higher doesn’t justify the memory cost.
CPU offloading is too slow for primary models but okay for background tasks.
Monitoring with nvidia-smi is essential—you can’t manage what you can’t measure.

If you’re running multiple models on a single GPU, build an unload mechanism. Don’t rely on Ollama to manage memory for you—it won’t.

Tech Expert & Vibe Coder

Debugging CUDA Out-of-Memory Errors in Ollama Multi-Model Deployments: Memory Pooling Strategies for 24GB VRAM Limits

Why I Started Debugging CUDA Memory Errors

My Actual Setup

What Actually Causes the Memory Leak

Failed Approach: Environment Variables

What Actually Worked: Explicit Model Unloading

Building a Model Rotation Script

Monitoring Memory Usage Properly

Quantization Trade-offs I Actually Encountered

CPU Offloading: When I Actually Use It

What Doesn’t Work: Modelfile Parameter Tuning

Multi-GPU Setup: I Don’t Have One

Key Takeaways

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Started Debugging CUDA Memory Errors

My Actual Setup

What Actually Causes the Memory Leak

Failed Approach: Environment Variables

What Actually Worked: Explicit Model Unloading

Building a Model Rotation Script

Monitoring Memory Usage Properly

Quantization Trade-offs I Actually Encountered

CPU Offloading: When I Actually Use It

What Doesn’t Work: Modelfile Parameter Tuning

Multi-GPU Setup: I Don’t Have One

Key Takeaways

Implementing Circuit Breakers for Self-Hosted LLM APIs: Preventing Cascading Failures in n8n Workflows with Timeout Fallbacks

Building a Local LLM Response Cache with Redis: Reducing Inference Costs and Latency for Repeated Queries

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Implementing automatic model selection based on query complexity: using...

Setting up hybrid inference pipelines: routing complex reasoning tasks to...

Debugging token generation slowdowns in LM Studio after extended uptime:...

Get new posts and practical tech notes in your inbox.