Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Debugging Ollama API rate limiting when running multiple concurrent inference requests from n8n automation workflows

Why I Worked on This

I run multiple n8n workflows that call Ollama for various tasks — summarizing documents, extracting structured data, and generating reports. Everything worked fine when I tested workflows individually. But the moment I started running three or four workflows concurrently, Ollama would choke. Requests timed out, some returned empty responses, and n8n nodes failed silently.

I needed to understand what was actually happening at the API level and fix it without just throwing more hardware at the problem.

My Real Setup

I run Ollama on a Proxmox VM with:

  • 16GB RAM allocated
  • No GPU passthrough (CPU inference only)
  • Models: llama3.2:3b and mistral:7b
  • n8n running in a separate Docker container on the same Proxmox host

The workflows call Ollama's /api/generate endpoint directly using n8n's HTTP Request node. No fancy libraries, no middleware — just plain HTTP POST requests with JSON payloads.

What I Observed

Initial Symptoms

When I triggered multiple workflows at once, I saw:

  • HTTP 503 errors from Ollama
  • Some requests hanging for 60+ seconds before timing out
  • Ollama's logs showing "connection reset by peer" messages
  • CPU spiking to 100% on the Ollama VM

I initially thought it was a model loading issue. But even with OLLAMA_KEEP_ALIVE set to keep models in memory, the same failures happened.

Actual Bottlenecks

I ran htop on the Ollama VM and watched what happened during concurrent requests. CPU maxed out immediately. RAM usage was fine — around 8GB used. The problem wasn't memory. It was CPU contention and Ollama's default concurrency limits.

I checked Ollama's environment variables. By default, it only handles a few concurrent requests before queuing or rejecting new ones. My workflows were sending 5-10 requests simultaneously, overwhelming that limit.

What Worked

1. Setting OLLAMA_NUM_PARALLEL

I added this to the Ollama systemd service file:

Environment="OLLAMA_NUM_PARALLEL=4"

Then restarted the service:

sudo systemctl daemon-reload
sudo systemctl restart ollama

This allowed Ollama to process up to 4 requests concurrently instead of queuing everything. It didn't eliminate failures, but reduced them significantly.

2. Limiting n8n Workflow Concurrency

Instead of letting n8n fire off unlimited parallel requests, I added a "Queue" node at the start of each workflow. This forces workflows to wait their turn before calling Ollama.

In n8n, I set:

  • Execution mode: Queue
  • Max concurrent executions: 2

This meant only two workflows could call Ollama at the same time. The rest waited in line. Not elegant, but it worked.

3. Adding Retry Logic in n8n

I configured the HTTP Request node to retry failed requests:

  • Retry on fail: Yes
  • Max retries: 3
  • Wait between retries: 5 seconds

This handled transient failures when Ollama was briefly overloaded. Some requests succeeded on the second or third attempt.

4. Reducing Context Window

I created a custom Modelfile for llama3.2 with a smaller context window:

FROM llama3.2:3b
PARAMETER num_ctx 1024

Then created the model:

ollama create llama3.2-fast -f ./Modelfile

This reduced memory usage per request and sped up inference slightly. Not a huge difference, but noticeable when running multiple workflows.

What Didn't Work

Increasing OLLAMA_NUM_PARALLEL Too High

I tried setting OLLAMA_NUM_PARALLEL=10, thinking more concurrency would help. It made things worse. CPU usage stayed at 100%, and requests took even longer because the VM was thrashing between too many inference tasks.

The sweet spot for my 16GB VM was 3-4 concurrent requests. Beyond that, performance degraded.

Using Streaming Responses

I thought streaming ("stream": true) would reduce load by sending partial responses. It didn't help. n8n doesn't handle streaming well in HTTP Request nodes, and I still needed to wait for the full response anyway.

Streaming might work if I built a custom n8n node or used a different client, but for my use case, it added complexity without benefit.

Connection Pooling in n8n

I tried enabling persistent connections in n8n's HTTP Request node settings. It had no noticeable effect. Ollama doesn't seem to benefit much from connection reuse when requests are CPU-bound, not network-bound.

How I Monitor It Now

I set up a simple monitoring workflow in n8n that pings Ollama every 5 minutes with a lightweight request:

{
  "model": "llama3.2-fast",
  "prompt": "Test",
  "stream": false
}

It logs response time to a Google Sheet. If response time exceeds 10 seconds, I get a notification. This gives me early warning before workflows start failing.

I also check Ollama's logs occasionally:

journalctl -u ollama -f

Looking for errors or warnings about connection limits.

Key Takeaways

  • Ollama's default concurrency settings are conservative. Adjust OLLAMA_NUM_PARALLEL based on your hardware, but don't overdo it.
  • Rate limiting isn't always about network throttling. In my case, it was CPU saturation.
  • Queuing workflows in n8n is more reliable than letting them all fire at once.
  • Smaller context windows reduce memory and CPU load per request.
  • Retry logic in n8n helps handle transient failures without manual intervention.
  • Monitoring response times gives you early warning before things break completely.

Current Limitations

This setup works for my current load (5-10 workflows per hour). If I need to scale beyond that, I'll either need to add GPU passthrough to the VM or run multiple Ollama instances behind a load balancer. Neither is trivial in my Proxmox environment.

I also haven't tested this with larger models like llama3:70b. Those would require more RAM and likely hit different bottlenecks.

The monitoring workflow only checks basic availability. It doesn't track per-model performance or detect gradual degradation. That would require more sophisticated logging, which I haven't built yet.