Building Nginx Rate Limiting Rules to Prevent AI API Abuse: Protecting Self-Hosted LLM Endpoints from Credential Stuffing

Why I Built Rate Limiting for My Self-Hosted LLM Endpoints

I run several self-hosted LLM endpoints on my Proxmox cluster. Some are Ollama instances, others are LiteLLM proxies sitting in front of different models. I expose a few of these through Nginx Proxy Manager for personal use and occasional sharing with friends who test things.

The problem hit me when I noticed unusual traffic patterns in my logs. Someone had gotten hold of an endpoint URL I’d shared in a private chat and was hammering it with what looked like credential stuffing attempts—hundreds of requests per minute trying different API keys. My LLM container was choking, response times went from seconds to timeouts, and my legitimate requests were stuck in queue.

Building Nginx Rate Limiting Rules to Prevent AI API Abuse: Protecting Self-Hosted LLM Endpoints from Credential Stuffing

I needed a way to throttle abusive clients at the edge, before they could burn through compute resources or guess valid tokens. Nginx’s rate limiting modules turned out to be exactly what I needed, but getting them configured properly took some trial and error.

My Setup and Context

I run Nginx as a reverse proxy in a Docker container on one of my Proxmox nodes. Behind it sit multiple LXC containers and VMs running different AI services:

Ollama instances with Llama models
A LiteLLM proxy that routes to multiple backends
A custom Python API wrapper I built for some fine-tuned models

Most of these endpoints require a simple API key in the X-API-Key header. I generate these manually and share them sparingly, but I learned the hard way that even private links eventually leak.

My Nginx configuration lives in a mounted volume so I can edit it directly without rebuilding containers. I use the standard Nginx Docker image, not Nginx Proxy Manager for this particular setup, because I wanted full control over rate limiting rules.

What Worked: Building Layered Rate Limits

Starting with Per-IP Limits

My first attempt was a basic per-IP rate limit. I added this to my nginx.conf:

limit_req_zone $binary_remote_addr zone=ip_limit:10m rate=10r/s;

server {
    listen 443 ssl;
    server_name llm.example.com;

    location /api/ {
        limit_req zone=ip_limit burst=20 nodelay;
        limit_req_status 429;
        proxy_pass http://ollama_backend;
    }
}

This created a shared memory zone called ip_limit that tracks request rates per IP address. The rate=10r/s means 10 requests per second average, with a burst allowance of 20 requests. The nodelay flag was important—it tells Nginx to serve the burst immediately rather than queuing requests, which keeps latency predictable for legitimate users.

This worked for stopping the most obvious abuse. The credential stuffing attempts dropped from hundreds per minute to zero within an hour of deployment. But I noticed a problem: my home office shares a single public IP with my neighbor’s network (we’re both behind the same ISP NAT). When I was testing locally, I’d occasionally hit my own rate limit.

Moving to Per-Token Limits

The better solution was to key rate limits by API token instead of IP address. This way, each legitimate user gets their own budget, and NAT issues disappear. Here’s what I implemented:

map $http_x_api_key $limit_key {
    default $binary_remote_addr;
    "~.+" $http_x_api_key;
}

limit_req_zone $limit_key zone=token_limit:20m rate=60r/m;

The map directive checks if an X-API-Key header exists. If it does, that becomes the rate limit key. If not, it falls back to IP address. This meant anonymous requests still got throttled, but authenticated users got their own separate budgets.

I set the rate to 60 requests per minute because my typical LLM queries take 5-10 seconds to complete. This gives me room for parallel requests from a client without being too generous to potential abusers.

Adding Connection Limits for Slow Attacks

Rate limiting alone didn’t solve everything. I discovered someone was holding connections open without sending data—a slow-loris style attack that tied up my Nginx worker processes. I added connection limits:

limit_conn_zone $limit_key zone=conn_limit:5m;

location /api/ {
    limit_conn conn_limit 5;
    limit_req zone=token_limit burst=30;
    proxy_pass http://ollama_backend;
}

This caps each client (by token or IP) to 5 concurrent connections. For LLM endpoints, this is plenty—most clients only need 1-2 connections at a time. The combination of rate limits and connection limits finally gave me proper protection.

What Didn’t Work

Overly Aggressive Burst Settings

My first attempt set burst=5, thinking I’d be strict. This immediately caused problems. When I’d refresh my testing UI or retry a failed request, I’d hit the limit and get 429 errors. LLM requests aren’t perfectly spaced—sometimes you send a few in quick succession, then nothing for a while. I had to increase the burst to 30 to match real usage patterns.

Forgetting the Dry Run Phase

I initially deployed rate limits without testing them in dry-run mode. Nginx has a limit_req_dry_run on; directive that logs violations without actually blocking them. I should have used this first to understand my traffic patterns. Instead, I locked myself out twice during testing and had to SSH into the container to fix the config.

Not Returning Proper 429 Responses

My first version just let Nginx return its default 503 error when rate limits triggered. This confused clients because 503 usually means “service down, retry later.” I needed to return 429 specifically and include a Retry-After header:

error_page 429 = @rate_limited;

location @rate_limited {
    default_type application/json;
    add_header Retry-After 60 always;
    return 429 '{"error":"rate_limited","retry_after":60}';
}

The always flag on add_header was critical—without it, Nginx won’t add headers to error responses.

Underestimating Memory Requirements

I initially allocated 5MB to my rate limit zones. This seemed like plenty, but I underestimated how many unique keys I’d see. Each IP or token consumes a small amount of memory to track state. When my zones filled up, Nginx started evicting old entries, which meant rate limits became inconsistent. I bumped it to 20MB and haven’t had issues since.

Monitoring and Adjusting

I added logging to track rate limit violations:

limit_req_log_level warn;

This writes a warning to the Nginx error log every time a request is rejected. I tail these logs with a simple monitoring script that sends me a notification if violations spike. This has helped me catch new attack patterns and adjust limits accordingly.

I also discovered that my legitimate usage varies wildly. During testing sessions, I might send 100+ requests in a few minutes. During normal use, maybe 10 per hour. I ended up creating separate location blocks with different limits for different endpoints—tighter limits on authentication routes, looser limits on model inference routes.

Key Takeaways

Rate limiting at the Nginx layer works well for protecting self-hosted LLM endpoints. The key lessons from my setup:

Start with per-IP limits, but move to per-token limits as soon as you have authentication
Use both rate limits (requests per second) and connection limits (concurrent sockets) to cover different attack patterns
Set burst values based on real usage patterns, not arbitrary strictness
Always test in dry-run mode first and watch the logs
Return proper 429 responses with Retry-After headers so clients know what’s happening
Allocate enough shared memory for your zones—20MB is a safe starting point

The configuration I run now has been stable for several months. I still see occasional rate limit violations in the logs, but they’re from obvious bots and scanners, not legitimate traffic. My LLM endpoints stay responsive, and I haven’t had another incident where abuse degraded service for real users.

One limitation I’m still working through: rate limits don’t prevent token guessing entirely, they just slow it down. If someone has enough patience and rotates IPs, they could still eventually guess a valid token. For that, I’d need proper authentication with short-lived JWTs and token rotation, which is a bigger project I haven’t tackled yet.

Tech Expert & Vibe Coder

Building Nginx Rate Limiting Rules to Prevent AI API Abuse: Protecting Self-Hosted LLM Endpoints from Credential Stuffing

Why I Built Rate Limiting for My Self-Hosted LLM Endpoints

My Setup and Context

What Worked: Building Layered Rate Limits

Starting with Per-IP Limits

Moving to Per-Token Limits

Adding Connection Limits for Slow Attacks

What Didn’t Work

Overly Aggressive Burst Settings

Forgetting the Dry Run Phase

Not Returning Proper 429 Responses

Underestimating Memory Requirements

Monitoring and Adjusting

Key Takeaways

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Built Rate Limiting for My Self-Hosted LLM Endpoints

My Setup and Context

What Worked: Building Layered Rate Limits

Starting with Per-IP Limits

Moving to Per-Token Limits

Adding Connection Limits for Slow Attacks

What Didn’t Work

Overly Aggressive Burst Settings

Forgetting the Dry Run Phase

Not Returning Proper 429 Responses

Underestimating Memory Requirements

Monitoring and Adjusting

Key Takeaways

Implementing Pi-hole Conditional Forwarding for Tailscale MagicDNS: Resolving Split-Horizon DNS Conflicts

Implementing Circuit Breakers for Self-Hosted LLM APIs: Preventing Cascading Failures in n8n Workflows with Timeout Fallbacks

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

implementing rate limiting for self-hosted api endpoints: combining traefik...

setting up caddy as a transparent proxy for legacy http-only services: adding...

building automated firewall rule testing with docker containers: validating...

Get new posts and practical tech notes in your inbox.