Debugging Ollama API rate limiting when running multiple concurrent inference requests from n8n automation workflows

Why I Worked on This

I run multiple n8n workflows that call Ollama for various tasks — summarizing documents, extracting structured data, and generating reports. Everything worked fine when I tested workflows individually. But the moment I started running three or four workflows concurrently, Ollama would choke. Requests timed out, some returned empty responses, and n8n nodes failed silently.

I needed to understand what was actually happening at the API level and fix it without just throwing more hardware at the problem.

My Real Setup

I run Ollama on a Proxmox VM with:

16GB RAM allocated
No GPU passthrough (CPU inference only)
Models: llama3.2:3b and mistral:7b
n8n running in a separate Docker container on the same Proxmox host

The workflows call Ollama’s /api/generate endpoint directly using n8n’s HTTP Request node. No fancy libraries, no middleware — just plain HTTP POST requests with JSON payloads.

What I Observed

Initial Symptoms

When I triggered multiple workflows at once, I saw:

HTTP 503 errors from Ollama
Some requests hanging for 60+ seconds before timing out
Ollama’s logs showing “connection reset by peer” messages
CPU spiking to 100% on the Ollama VM

I initially thought it was a model loading issue. But even with OLLAMA_KEEP_ALIVE set to keep models in memory, the same failures happened.

Actual Bottlenecks

I ran htop on the Ollama VM and watched what happened during concurrent requests. CPU maxed out immediately. RAM usage was fine — around 8GB used. The problem wasn’t memory. It was CPU contention and Ollama’s default concurrency limits.

I checked Ollama’s environment variables. By default, it only handles a few concurrent requests before queuing or rejecting new ones. My workflows were sending 5-10 requests simultaneously, overwhelming that limit.

What Worked

1. Setting OLLAMA_NUM_PARALLEL

I added this to the Ollama systemd service file:

Environment="OLLAMA_NUM_PARALLEL=4"

Then restarted the service:

sudo systemctl daemon-reload
sudo systemctl restart ollama

This allowed Ollama to process up to 4 requests concurrently instead of queuing everything. It didn’t eliminate failures, but reduced them significantly.

2. Limiting n8n Workflow Concurrency

Instead of letting n8n fire off unlimited parallel requests, I added a “Queue” node at the start of each workflow. This forces workflows to wait their turn before calling Ollama.

In n8n, I set:

Execution mode: Queue
Max concurrent executions: 2

This meant only two workflows could call Ollama at the same time. The rest waited in line. Not elegant, but it worked.

3. Adding Retry Logic in n8n

I configured the HTTP Request node to retry failed requests:

Retry on fail: Yes
Max retries: 3
Wait between retries: 5 seconds

This handled transient failures when Ollama was briefly overloaded. Some requests succeeded on the second or third attempt.

4. Reducing Context Window

I created a custom Modelfile for llama3.2 with a smaller context window:

FROM llama3.2:3b
PARAMETER num_ctx 1024

Then created the model:

ollama create llama3.2-fast -f ./Modelfile

This reduced memory usage per request and sped up inference slightly. Not a huge difference, but noticeable when running multiple workflows.

What Didn’t Work

Increasing OLLAMA_NUM_PARALLEL Too High

I tried setting OLLAMA_NUM_PARALLEL=10, thinking more concurrency would help. It made things worse. CPU usage stayed at 100%, and requests took even longer because the VM was thrashing between too many inference tasks.

The sweet spot for my 16GB VM was 3-4 concurrent requests. Beyond that, performance degraded.

Using Streaming Responses

I thought streaming ("stream": true) would reduce load by sending partial responses. It didn’t help. n8n doesn’t handle streaming well in HTTP Request nodes, and I still needed to wait for the full response anyway.

Streaming might work if I built a custom n8n node or used a different client, but for my use case, it added complexity without benefit.

Connection Pooling in n8n

I tried enabling persistent connections in n8n’s HTTP Request node settings. It had no noticeable effect. Ollama doesn’t seem to benefit much from connection reuse when requests are CPU-bound, not network-bound.

How I Monitor It Now

I set up a simple monitoring workflow in n8n that pings Ollama every 5 minutes with a lightweight request:

{
  "model": "llama3.2-fast",
  "prompt": "Test",
  "stream": false
}

It logs response time to a Google Sheet. If response time exceeds 10 seconds, I get a notification. This gives me early warning before workflows start failing.

I also check Ollama’s logs occasionally:

journalctl -u ollama -f

Looking for errors or warnings about connection limits.

Key Takeaways

Ollama’s default concurrency settings are conservative. Adjust OLLAMA_NUM_PARALLEL based on your hardware, but don’t overdo it.
Rate limiting isn’t always about network throttling. In my case, it was CPU saturation.
Queuing workflows in n8n is more reliable than letting them all fire at once.
Smaller context windows reduce memory and CPU load per request.
Retry logic in n8n helps handle transient failures without manual intervention.
Monitoring response times gives you early warning before things break completely.

Current Limitations

This setup works for my current load (5-10 workflows per hour). If I need to scale beyond that, I’ll either need to add GPU passthrough to the VM or run multiple Ollama instances behind a load balancer. Neither is trivial in my Proxmox environment.

I also haven’t tested this with larger models like llama3:70b. Those would require more RAM and likely hit different bottlenecks.

The monitoring workflow only checks basic availability. It doesn’t track per-model performance or detect gradual degradation. That would require more sophisticated logging, which I haven’t built yet.

Tech Expert & Vibe Coder

Why I Worked on This

My Real Setup

What I Observed

Initial Symptoms

Actual Bottlenecks

What Worked

1. Setting OLLAMA_NUM_PARALLEL

2. Limiting n8n Workflow Concurrency

3. Adding Retry Logic in n8n

4. Reducing Context Window

What Didn’t Work

Increasing OLLAMA_NUM_PARALLEL Too High

Using Streaming Responses

Connection Pooling in n8n

How I Monitor It Now

Key Takeaways

Current Limitations

Category:

Implementing automatic model...

Setting up hybrid inference...

Leave a Comment Cancel reply

Categories

Related Posts

Implementing automatic model selection based on...

Setting up hybrid inference pipelines: routing...

Debugging token generation slowdowns in LM Studio...

About Me

Vipin PG

Tech Expert & Vibe Coder

Debugging Ollama API rate limiting when running multiple concurrent inference requests from n8n automation workflows

Why I Worked on This

My Real Setup

What I Observed

Initial Symptoms

Actual Bottlenecks

What Worked

1. Setting OLLAMA_NUM_PARALLEL

2. Limiting n8n Workflow Concurrency

3. Adding Retry Logic in n8n

4. Reducing Context Window

What Didn’t Work

Increasing OLLAMA_NUM_PARALLEL Too High

Using Streaming Responses

Connection Pooling in n8n

How I Monitor It Now

Key Takeaways

Current Limitations

Category:

Implementing automatic model...

Setting up hybrid inference...

Leave a Comment Cancel reply

Subscribe to Newsletter

Categories

Related Posts

Implementing automatic model selection based on...

Setting up hybrid inference pipelines: routing...

Debugging token generation slowdowns in LM Studio...

About Me

Vipin PG