Why I Built Rate Limiting for My Self-Hosted LLM Endpoints
I run several self-hosted LLM endpoints on my Proxmox cluster. Some are Ollama instances, others are LiteLLM proxies sitting in front of different models. I expose a few of these through Nginx Proxy Manager for personal use and occasional sharing with friends who test things.
The problem hit me when I noticed unusual traffic patterns in my logs. Someone had gotten hold of an endpoint URL I’d shared in a private chat and was hammering it with what looked like credential stuffing attempts—hundreds of requests per minute trying different API keys. My LLM container was choking, response times went from seconds to timeouts, and my legitimate requests were stuck in queue.
I needed a way to throttle abusive clients at the edge, before they could burn through compute resources or guess valid tokens. Nginx’s rate limiting modules turned out to be exactly what I needed, but getting them configured properly took some trial and error.
My Setup and Context
I run Nginx as a reverse proxy in a Docker container on one of my Proxmox nodes. Behind it sit multiple LXC containers and VMs running different AI services:
- Ollama instances with Llama models
- A LiteLLM proxy that routes to multiple backends
- A custom Python API wrapper I built for some fine-tuned models
Most of these endpoints require a simple API key in the X-API-Key header. I generate these manually and share them sparingly, but I learned the hard way that even private links eventually leak.
My Nginx configuration lives in a mounted volume so I can edit it directly without rebuilding containers. I use the standard Nginx Docker image, not Nginx Proxy Manager for this particular setup, because I wanted full control over rate limiting rules.
What Worked: Building Layered Rate Limits
Starting with Per-IP Limits
My first attempt was a basic per-IP rate limit. I added this to my nginx.conf:
limit_req_zone $binary_remote_addr zone=ip_limit:10m rate=10r/s;
server {
listen 443 ssl;
server_name llm.example.com;
location /api/ {
limit_req zone=ip_limit burst=20 nodelay;
limit_req_status 429;
proxy_pass http://ollama_backend;
}
}
This created a shared memory zone called ip_limit that tracks request rates per IP address. The rate=10r/s means 10 requests per second average, with a burst allowance of 20 requests. The nodelay flag was important—it tells Nginx to serve the burst immediately rather than queuing requests, which keeps latency predictable for legitimate users.
This worked for stopping the most obvious abuse. The credential stuffing attempts dropped from hundreds per minute to zero within an hour of deployment. But I noticed a problem: my home office shares a single public IP with my neighbor’s network (we’re both behind the same ISP NAT). When I was testing locally, I’d occasionally hit my own rate limit.
Moving to Per-Token Limits
The better solution was to key rate limits by API token instead of IP address. This way, each legitimate user gets their own budget, and NAT issues disappear. Here’s what I implemented:
map $http_x_api_key $limit_key {
default $binary_remote_addr;
"~.+" $http_x_api_key;
}
limit_req_zone $limit_key zone=token_limit:20m rate=60r/m;
The map directive checks if an X-API-Key header exists. If it does, that becomes the rate limit key. If not, it falls back to IP address. This meant anonymous requests still got throttled, but authenticated users got their own separate budgets.
I set the rate to 60 requests per minute because my typical LLM queries take 5-10 seconds to complete. This gives me room for parallel requests from a client without being too generous to potential abusers.
Adding Connection Limits for Slow Attacks
Rate limiting alone didn’t solve everything. I discovered someone was holding connections open without sending data—a slow-loris style attack that tied up my Nginx worker processes. I added connection limits:
limit_conn_zone $limit_key zone=conn_limit:5m;
location /api/ {
limit_conn conn_limit 5;
limit_req zone=token_limit burst=30;
proxy_pass http://ollama_backend;
}
This caps each client (by token or IP) to 5 concurrent connections. For LLM endpoints, this is plenty—most clients only need 1-2 connections at a time. The combination of rate limits and connection limits finally gave me proper protection.
What Didn’t Work
Overly Aggressive Burst Settings
My first attempt set burst=5, thinking I’d be strict. This immediately caused problems. When I’d refresh my testing UI or retry a failed request, I’d hit the limit and get 429 errors. LLM requests aren’t perfectly spaced—sometimes you send a few in quick succession, then nothing for a while. I had to increase the burst to 30 to match real usage patterns.
Forgetting the Dry Run Phase
I initially deployed rate limits without testing them in dry-run mode. Nginx has a limit_req_dry_run on; directive that logs violations without actually blocking them. I should have used this first to understand my traffic patterns. Instead, I locked myself out twice during testing and had to SSH into the container to fix the config.
Not Returning Proper 429 Responses
My first version just let Nginx return its default 503 error when rate limits triggered. This confused clients because 503 usually means “service down, retry later.” I needed to return 429 specifically and include a Retry-After header:
error_page 429 = @rate_limited;
location @rate_limited {
default_type application/json;
add_header Retry-After 60 always;
return 429 '{"error":"rate_limited","retry_after":60}';
}
The always flag on add_header was critical—without it, Nginx won’t add headers to error responses.
Underestimating Memory Requirements
I initially allocated 5MB to my rate limit zones. This seemed like plenty, but I underestimated how many unique keys I’d see. Each IP or token consumes a small amount of memory to track state. When my zones filled up, Nginx started evicting old entries, which meant rate limits became inconsistent. I bumped it to 20MB and haven’t had issues since.
Monitoring and Adjusting
I added logging to track rate limit violations:
limit_req_log_level warn;
This writes a warning to the Nginx error log every time a request is rejected. I tail these logs with a simple monitoring script that sends me a notification if violations spike. This has helped me catch new attack patterns and adjust limits accordingly.
I also discovered that my legitimate usage varies wildly. During testing sessions, I might send 100+ requests in a few minutes. During normal use, maybe 10 per hour. I ended up creating separate location blocks with different limits for different endpoints—tighter limits on authentication routes, looser limits on model inference routes.
Key Takeaways
Rate limiting at the Nginx layer works well for protecting self-hosted LLM endpoints. The key lessons from my setup:
- Start with per-IP limits, but move to per-token limits as soon as you have authentication
- Use both rate limits (requests per second) and connection limits (concurrent sockets) to cover different attack patterns
- Set burst values based on real usage patterns, not arbitrary strictness
- Always test in dry-run mode first and watch the logs
- Return proper 429 responses with Retry-After headers so clients know what’s happening
- Allocate enough shared memory for your zones—20MB is a safe starting point
The configuration I run now has been stable for several months. I still see occasional rate limit violations in the logs, but they’re from obvious bots and scanners, not legitimate traffic. My LLM endpoints stay responsive, and I haven’t had another incident where abuse degraded service for real users.
One limitation I’m still working through: rate limits don’t prevent token guessing entirely, they just slow it down. If someone has enough patience and rotates IPs, they could still eventually guess a valid token. For that, I’d need proper authentication with short-lived JWTs and token rotation, which is a bigger project I haven’t tackled yet.