Why I Worked on This
I run multiple n8n workflows that call Ollama for various tasks — summarizing documents, extracting structured data, and generating reports. Everything worked fine when I tested workflows individually. But the moment I started running three or four workflows concurrently, Ollama would choke. Requests timed out, some returned empty responses, and n8n nodes failed silently.
I needed to understand what was actually happening at the API level and fix it without just throwing more hardware at the problem.
My Real Setup
I run Ollama on a Proxmox VM with:
- 16GB RAM allocated
- No GPU passthrough (CPU inference only)
- Models: llama3.2:3b and mistral:7b
- n8n running in a separate Docker container on the same Proxmox host
The workflows call Ollama's /api/generate endpoint directly using n8n's HTTP Request node. No fancy libraries, no middleware — just plain HTTP POST requests with JSON payloads.
What I Observed
Initial Symptoms
When I triggered multiple workflows at once, I saw:
- HTTP 503 errors from Ollama
- Some requests hanging for 60+ seconds before timing out
- Ollama's logs showing "connection reset by peer" messages
- CPU spiking to 100% on the Ollama VM
I initially thought it was a model loading issue. But even with OLLAMA_KEEP_ALIVE set to keep models in memory, the same failures happened.
Actual Bottlenecks
I ran htop on the Ollama VM and watched what happened during concurrent requests. CPU maxed out immediately. RAM usage was fine — around 8GB used. The problem wasn't memory. It was CPU contention and Ollama's default concurrency limits.
I checked Ollama's environment variables. By default, it only handles a few concurrent requests before queuing or rejecting new ones. My workflows were sending 5-10 requests simultaneously, overwhelming that limit.
What Worked
1. Setting OLLAMA_NUM_PARALLEL
I added this to the Ollama systemd service file:
Environment="OLLAMA_NUM_PARALLEL=4"
Then restarted the service:
sudo systemctl daemon-reload
sudo systemctl restart ollama
This allowed Ollama to process up to 4 requests concurrently instead of queuing everything. It didn't eliminate failures, but reduced them significantly.
2. Limiting n8n Workflow Concurrency
Instead of letting n8n fire off unlimited parallel requests, I added a "Queue" node at the start of each workflow. This forces workflows to wait their turn before calling Ollama.
In n8n, I set:
- Execution mode: Queue
- Max concurrent executions: 2
This meant only two workflows could call Ollama at the same time. The rest waited in line. Not elegant, but it worked.
3. Adding Retry Logic in n8n
I configured the HTTP Request node to retry failed requests:
- Retry on fail: Yes
- Max retries: 3
- Wait between retries: 5 seconds
This handled transient failures when Ollama was briefly overloaded. Some requests succeeded on the second or third attempt.
4. Reducing Context Window
I created a custom Modelfile for llama3.2 with a smaller context window:
FROM llama3.2:3b
PARAMETER num_ctx 1024
Then created the model:
ollama create llama3.2-fast -f ./Modelfile
This reduced memory usage per request and sped up inference slightly. Not a huge difference, but noticeable when running multiple workflows.
What Didn't Work
Increasing OLLAMA_NUM_PARALLEL Too High
I tried setting OLLAMA_NUM_PARALLEL=10, thinking more concurrency would help. It made things worse. CPU usage stayed at 100%, and requests took even longer because the VM was thrashing between too many inference tasks.
The sweet spot for my 16GB VM was 3-4 concurrent requests. Beyond that, performance degraded.
Using Streaming Responses
I thought streaming ("stream": true) would reduce load by sending partial responses. It didn't help. n8n doesn't handle streaming well in HTTP Request nodes, and I still needed to wait for the full response anyway.
Streaming might work if I built a custom n8n node or used a different client, but for my use case, it added complexity without benefit.
Connection Pooling in n8n
I tried enabling persistent connections in n8n's HTTP Request node settings. It had no noticeable effect. Ollama doesn't seem to benefit much from connection reuse when requests are CPU-bound, not network-bound.
How I Monitor It Now
I set up a simple monitoring workflow in n8n that pings Ollama every 5 minutes with a lightweight request:
{
"model": "llama3.2-fast",
"prompt": "Test",
"stream": false
}
It logs response time to a Google Sheet. If response time exceeds 10 seconds, I get a notification. This gives me early warning before workflows start failing.
I also check Ollama's logs occasionally:
journalctl -u ollama -f
Looking for errors or warnings about connection limits.
Key Takeaways
- Ollama's default concurrency settings are conservative. Adjust
OLLAMA_NUM_PARALLELbased on your hardware, but don't overdo it. - Rate limiting isn't always about network throttling. In my case, it was CPU saturation.
- Queuing workflows in n8n is more reliable than letting them all fire at once.
- Smaller context windows reduce memory and CPU load per request.
- Retry logic in n8n helps handle transient failures without manual intervention.
- Monitoring response times gives you early warning before things break completely.
Current Limitations
This setup works for my current load (5-10 workflows per hour). If I need to scale beyond that, I'll either need to add GPU passthrough to the VM or run multiple Ollama instances behind a load balancer. Neither is trivial in my Proxmox environment.
I also haven't tested this with larger models like llama3:70b. Those would require more RAM and likely hit different bottlenecks.
The monitoring workflow only checks basic availability. It doesn't track per-model performance or detect gradual degradation. That would require more sophisticated logging, which I haven't built yet.