Debugging Memory Leaks in Long-Running Ollama Instances: Monitoring VRAM Fragmentation and Implementing Automatic Model Reloads

Why I Started Looking Into This

I run Ollama on my Proxmox server with a passthrough NVIDIA GPU. It handles various automation tasks through n8n—summarizing content, processing text, answering queries. Most of these workloads are small and infrequent, but they run 24/7.

After a few days of uptime, I noticed response times degrading. What normally took 2-3 seconds would suddenly spike to 10-15 seconds. Restarting the Ollama container fixed it immediately, which pointed to something accumulating over time rather than a configuration problem.

Debugging Memory Leaks in Long-Running Ollama Instances: Monitoring VRAM Fragmentation and Implementing Automatic Model Reloads

I initially assumed it was a model caching issue or some quirk in how Ollama manages concurrent requests. Turns out, it was VRAM fragmentation combined with how Ollama keeps models loaded in memory.

My Setup

The relevant parts:

Proxmox host with an NVIDIA RTX 3060 (12GB VRAM) passed through to an LXC container
Ollama running in Docker inside that container
Models I use regularly: llama3.2 (3B), mistral (7B), and occasionally deepseek-coder
n8n workflows triggering inference requests at irregular intervals—sometimes back-to-back, sometimes hours apart
No load balancer, no model offloading to CPU, everything stays on GPU

I don’t run a high-throughput setup. This is personal infrastructure for automation, not production ML serving.

What I Observed

The first sign was inconsistent performance. A workflow that reliably completed in under 5 seconds would randomly take 20+ seconds. Checking nvidia-smi showed VRAM usage sitting around 8-9GB even when idle, which seemed high for models that should only need 4-6GB combined.

I added basic monitoring using a shell script that logged VRAM usage every minute:

#!/bin/bash
while true; do
  nvidia-smi --query-gpu=timestamp,memory.used,memory.free --format=csv,noheader,nounits >> /var/log/vram_usage.log
  sleep 60
done

Over 48 hours, I saw a pattern: VRAM usage would climb gradually, plateau, then stay there. Even after workflows finished and no inference was happening, memory didn’t release. The Ollama API reported models as “loaded” but not actively processing anything.

When I manually unloaded models using ollama stop <model>, VRAM dropped immediately. That confirmed the issue wasn’t a leak in the traditional sense—Ollama was holding onto models as designed, but fragmentation or allocation overhead was causing the bloat.

Understanding Ollama’s Memory Behavior

Ollama keeps models in VRAM after use to speed up subsequent requests. This makes sense for interactive use, but my workload pattern—sporadic requests across different models—meant I was accumulating loaded models without ever fully cycling them out.

There’s a OLLAMA_KEEP_ALIVE environment variable that controls how long models stay loaded. The default is 5 minutes. I tried setting it to 1 minute:

OLLAMA_KEEP_ALIVE=1m

This helped slightly, but didn’t solve the core problem. Models were unloading faster, but VRAM fragmentation still built up over days. The real issue was that even after unloading, memory wasn’t being reclaimed cleanly.

Attempted Fixes That Didn’t Work

I tried a few things that seemed logical but didn’t pan out:

Lowering context window size: Reduced num_ctx from 4096 to 2048 in model parameters. Helped with initial load size but didn’t stop the gradual climb.
Switching to smaller quantized models: Moved from Q5 to Q4 quants. Reduced baseline usage but fragmentation still accumulated.
Manually calling garbage collection: Tried triggering Python’s gc.collect() in workflows. Ollama is written in Go, so this did nothing.

None of these addressed the root cause. The only reliable fix was restarting the Ollama service.

What Actually Worked

I implemented two changes:

1. Periodic Ollama Restarts

I added a systemd timer (running on the host, not in the container) to restart the Ollama Docker container every 24 hours during a low-usage window:

# /etc/systemd/system/ollama-restart.service
[Unit]
Description=Restart Ollama container

[Service]
Type=oneshot
ExecStart=/usr/bin/docker restart ollama

# /etc/systemd/system/ollama-restart.timer
[Unit]
Description=Restart Ollama daily at 3 AM

[Timer]
OnCalendar=daily
OnCalendar=*-*-* 03:00:00
Persistent=true

[Install]
WantedBy=timers.target

Enable it with:

systemctl daemon-reload
systemctl enable --now ollama-restart.timer

This brute-force approach works. VRAM usage resets to baseline, and performance stays consistent. It’s not elegant, but it’s reliable.

2. Proactive Model Unloading in Workflows

In n8n, I added a final step to workflows that explicitly unloads the model after inference completes:

curl -X POST http://ollama:11434/api/generate -d '{
  "model": "llama3.2",
  "keep_alive": 0
}'

Setting keep_alive to 0 forces immediate unload. This reduced the number of models sitting idle in VRAM between workflow runs.

I also set a global OLLAMA_MAX_LOADED_MODELS=1 in the Docker environment. This ensures only one model stays in memory at a time, forcing older ones out when a new one loads.

Monitoring VRAM Over Time

I wanted better visibility than manually checking nvidia-smi, so I set up a simple Prometheus exporter using nvidia_gpu_exporter:

docker run -d 
  --name=nvidia_exporter 
  --gpus all 
  -p 9835:9835 
  mindprince/nvidia_gpu_prometheus_exporter:latest

Then added it as a scrape target in my existing Prometheus instance and built a Grafana dashboard tracking:

VRAM used vs. free over time
GPU utilization spikes (correlates with inference requests)
Temperature (to catch thermal throttling, which I’ve seen before)

This gave me a clear view of memory patterns. I could see exactly when fragmentation started building and confirm that restarts brought it back down.

What I Learned About VRAM Fragmentation

VRAM fragmentation isn’t the same as a traditional memory leak. The memory is “used” from the GPU’s perspective, but it’s not efficiently packed. When Ollama loads and unloads models repeatedly, especially different sizes, gaps form in the allocation space.

CUDA doesn’t defragment automatically the way some OS-level memory managers do. Once fragmentation sets in, the only way to clear it is to reset the CUDA context—which effectively means restarting the process using the GPU.

I confirmed this by watching nvidia-smi during a restart. Memory usage would drop from 9GB to 1GB instantly, even though no models were “leaked” in the traditional sense.

Current State

With daily restarts and aggressive model unloading, the system is stable. VRAM usage stays under 6GB during normal operation, and response times are consistent.

I still see minor fragmentation build up over the 24-hour window, but it never reaches the point where performance degrades noticeably. The restart cycle keeps it in check.

I’d prefer a solution that doesn’t require restarts, but I haven’t found one that works reliably with my usage pattern. If Ollama adds better memory management or a manual defragmentation API in the future, I’ll revisit this.

Key Takeaways

VRAM fragmentation is real in long-running GPU workloads, especially with variable model sizes
Ollama’s default behavior of keeping models loaded is great for interactive use but problematic for sporadic automation
Setting OLLAMA_KEEP_ALIVE low and OLLAMA_MAX_LOADED_MODELS=1 helps but doesn’t eliminate the issue
Scheduled restarts are a practical workaround when you control the infrastructure
Monitoring VRAM usage over time is essential—problems aren’t obvious until you track them

This isn’t a problem with Ollama specifically. Any system that loads and unloads GPU workloads dynamically will hit this eventually. The solution is either architectural (better memory management in the tool) or operational (restart cycles). I went with operational because it works and I can automate it.

Tech Expert & Vibe Coder

Debugging Memory Leaks in Long-Running Ollama Instances: Monitoring VRAM Fragmentation and Implementing Automatic Model Reloads

Why I Started Looking Into This

My Setup

What I Observed

Understanding Ollama’s Memory Behavior

Attempted Fixes That Didn’t Work

What Actually Worked

1. Periodic Ollama Restarts

2. Proactive Model Unloading in Workflows

Monitoring VRAM Over Time

What I Learned About VRAM Fragmentation

Current State

Key Takeaways

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Started Looking Into This

My Setup

What I Observed

Understanding Ollama’s Memory Behavior

Attempted Fixes That Didn’t Work

What Actually Worked

1. Periodic Ollama Restarts

2. Proactive Model Unloading in Workflows

Monitoring VRAM Over Time

What I Learned About VRAM Fragmentation

Current State

Key Takeaways

Implementing Semantic Caching for LLM APIs: Using Vector Embeddings to Match Similar Queries and Reduce Inference Costs

Optimizing RTX 5090 VRAM allocation for parallel LLM inference: running multiple Ollama models simultaneously without memory thrashing

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Implementing automatic model selection based on query complexity: using...

Setting up hybrid inference pipelines: routing complex reasoning tasks to...

Debugging token generation slowdowns in LM Studio after extended uptime:...

Get new posts and practical tech notes in your inbox.