Why I Started Looking Into This
I run Ollama on my Proxmox server with a passthrough NVIDIA GPU. It handles various automation tasks through n8n—summarizing content, processing text, answering queries. Most of these workloads are small and infrequent, but they run 24/7.
After a few days of uptime, I noticed response times degrading. What normally took 2-3 seconds would suddenly spike to 10-15 seconds. Restarting the Ollama container fixed it immediately, which pointed to something accumulating over time rather than a configuration problem.
I initially assumed it was a model caching issue or some quirk in how Ollama manages concurrent requests. Turns out, it was VRAM fragmentation combined with how Ollama keeps models loaded in memory.
My Setup
The relevant parts:
- Proxmox host with an NVIDIA RTX 3060 (12GB VRAM) passed through to an LXC container
- Ollama running in Docker inside that container
- Models I use regularly: llama3.2 (3B), mistral (7B), and occasionally deepseek-coder
- n8n workflows triggering inference requests at irregular intervals—sometimes back-to-back, sometimes hours apart
- No load balancer, no model offloading to CPU, everything stays on GPU
I don’t run a high-throughput setup. This is personal infrastructure for automation, not production ML serving.
What I Observed
The first sign was inconsistent performance. A workflow that reliably completed in under 5 seconds would randomly take 20+ seconds. Checking nvidia-smi showed VRAM usage sitting around 8-9GB even when idle, which seemed high for models that should only need 4-6GB combined.
I added basic monitoring using a shell script that logged VRAM usage every minute:
#!/bin/bash
while true; do
nvidia-smi --query-gpu=timestamp,memory.used,memory.free --format=csv,noheader,nounits >> /var/log/vram_usage.log
sleep 60
done
Over 48 hours, I saw a pattern: VRAM usage would climb gradually, plateau, then stay there. Even after workflows finished and no inference was happening, memory didn’t release. The Ollama API reported models as “loaded” but not actively processing anything.
When I manually unloaded models using ollama stop <model>, VRAM dropped immediately. That confirmed the issue wasn’t a leak in the traditional sense—Ollama was holding onto models as designed, but fragmentation or allocation overhead was causing the bloat.
Understanding Ollama’s Memory Behavior
Ollama keeps models in VRAM after use to speed up subsequent requests. This makes sense for interactive use, but my workload pattern—sporadic requests across different models—meant I was accumulating loaded models without ever fully cycling them out.
There’s a OLLAMA_KEEP_ALIVE environment variable that controls how long models stay loaded. The default is 5 minutes. I tried setting it to 1 minute:
OLLAMA_KEEP_ALIVE=1m
This helped slightly, but didn’t solve the core problem. Models were unloading faster, but VRAM fragmentation still built up over days. The real issue was that even after unloading, memory wasn’t being reclaimed cleanly.
Attempted Fixes That Didn’t Work
I tried a few things that seemed logical but didn’t pan out:
- Lowering context window size: Reduced
num_ctxfrom 4096 to 2048 in model parameters. Helped with initial load size but didn’t stop the gradual climb. - Switching to smaller quantized models: Moved from Q5 to Q4 quants. Reduced baseline usage but fragmentation still accumulated.
- Manually calling garbage collection: Tried triggering Python’s
gc.collect()in workflows. Ollama is written in Go, so this did nothing.
None of these addressed the root cause. The only reliable fix was restarting the Ollama service.
What Actually Worked
I implemented two changes:
1. Periodic Ollama Restarts
I added a systemd timer (running on the host, not in the container) to restart the Ollama Docker container every 24 hours during a low-usage window:
# /etc/systemd/system/ollama-restart.service
[Unit]
Description=Restart Ollama container
[Service]
Type=oneshot
ExecStart=/usr/bin/docker restart ollama
# /etc/systemd/system/ollama-restart.timer
[Unit]
Description=Restart Ollama daily at 3 AM
[Timer]
OnCalendar=daily
OnCalendar=*-*-* 03:00:00
Persistent=true
[Install]
WantedBy=timers.target
Enable it with:
systemctl daemon-reload
systemctl enable --now ollama-restart.timer
This brute-force approach works. VRAM usage resets to baseline, and performance stays consistent. It’s not elegant, but it’s reliable.
2. Proactive Model Unloading in Workflows
In n8n, I added a final step to workflows that explicitly unloads the model after inference completes:
curl -X POST http://ollama:11434/api/generate -d '{
"model": "llama3.2",
"keep_alive": 0
}'
Setting keep_alive to 0 forces immediate unload. This reduced the number of models sitting idle in VRAM between workflow runs.
I also set a global OLLAMA_MAX_LOADED_MODELS=1 in the Docker environment. This ensures only one model stays in memory at a time, forcing older ones out when a new one loads.
Monitoring VRAM Over Time
I wanted better visibility than manually checking nvidia-smi, so I set up a simple Prometheus exporter using nvidia_gpu_exporter:
docker run -d
--name=nvidia_exporter
--gpus all
-p 9835:9835
mindprince/nvidia_gpu_prometheus_exporter:latest
Then added it as a scrape target in my existing Prometheus instance and built a Grafana dashboard tracking:
- VRAM used vs. free over time
- GPU utilization spikes (correlates with inference requests)
- Temperature (to catch thermal throttling, which I’ve seen before)
This gave me a clear view of memory patterns. I could see exactly when fragmentation started building and confirm that restarts brought it back down.
What I Learned About VRAM Fragmentation
VRAM fragmentation isn’t the same as a traditional memory leak. The memory is “used” from the GPU’s perspective, but it’s not efficiently packed. When Ollama loads and unloads models repeatedly, especially different sizes, gaps form in the allocation space.
CUDA doesn’t defragment automatically the way some OS-level memory managers do. Once fragmentation sets in, the only way to clear it is to reset the CUDA context—which effectively means restarting the process using the GPU.
I confirmed this by watching nvidia-smi during a restart. Memory usage would drop from 9GB to 1GB instantly, even though no models were “leaked” in the traditional sense.
Current State
With daily restarts and aggressive model unloading, the system is stable. VRAM usage stays under 6GB during normal operation, and response times are consistent.
I still see minor fragmentation build up over the 24-hour window, but it never reaches the point where performance degrades noticeably. The restart cycle keeps it in check.
I’d prefer a solution that doesn’t require restarts, but I haven’t found one that works reliably with my usage pattern. If Ollama adds better memory management or a manual defragmentation API in the future, I’ll revisit this.
Key Takeaways
- VRAM fragmentation is real in long-running GPU workloads, especially with variable model sizes
- Ollama’s default behavior of keeping models loaded is great for interactive use but problematic for sporadic automation
- Setting
OLLAMA_KEEP_ALIVElow andOLLAMA_MAX_LOADED_MODELS=1helps but doesn’t eliminate the issue - Scheduled restarts are a practical workaround when you control the infrastructure
- Monitoring VRAM usage over time is essential—problems aren’t obvious until you track them
This isn’t a problem with Ollama specifically. Any system that loads and unloads GPU workloads dynamically will hit this eventually. The solution is either architectural (better memory management in the tool) or operational (restart cycles). I went with operational because it works and I can automate it.