Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Configuring Nginx Rate Limiting to Block AI Scrapers Without Breaking Legitimate Bots

Why I Started Rate-Limiting AI Scrapers

I run several sites on my Proxmox cluster, mostly behind Nginx. Over the past year, I noticed traffic spikes that had nothing to do with real visitors. Looking at logs, I'd see the same IP hitting dozens of URLs per second—tag pages, archives, search endpoints—anything that returns content. The User-Agent strings were often lies. Some claimed to be Googlebot. Others pretended to be Firefox. A few were honest about being scrapers, but most weren't.

My server isn't struggling, but I don't want to subsidize someone's dataset collection either. I also didn't want to block legitimate crawlers like search engines or monitoring tools I actually use. So I needed something that separated aggressive behavior from normal browsing without maintaining endless IP blocklists.

My Setup: Nginx on Proxmox with Docker Containers

I run Nginx as a reverse proxy in front of several Docker containers. Some serve static sites. Others run WordPress or custom apps. All traffic flows through one Nginx instance that handles SSL termination and routing.

I already had fail2ban running to catch SSH brute-force attempts, so extending it to watch web traffic made sense. My log format is standard combined format, written to /var/log/nginx/.

What Actually Works: Rate Limiting by IP

I added a rate limit zone in Nginx's http context. This lives in /etc/nginx/conf.d/20-limit-req-zone.conf:

limit_req_zone $binary_remote_addr zone=scraperlimit:10m rate=4r/s;

This tracks requests per IP address. The zone uses 10MB of memory, which handles roughly 100,000 unique IPs—way more than I need. The rate is 4 requests per second. That's enough for normal page loads but too slow for scrapers trying to pull entire sites.

Inside each site's location block, I apply the limit:

limit_req zone=scraperlimit burst=20 nodelay;
limit_req_status 402;

The burst allows 20 requests before the limit kicks in. This handles initial page loads where a browser fetches CSS, JavaScript, and images all at once. After that, the 4r/s average applies.

I use HTTP 402 (Payment Required) as the error code. It's unusual enough that it won't collide with normal errors, and it signals my position clearly: if you want to hammer my server, pay for the bandwidth.

Connecting to fail2ban

The 402 code feeds into fail2ban. I created a filter at /etc/fail2ban/filter.d/nginx-antiscraper.conf:

[Definition]
retcodes = 418|444|402
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP/\d+.\d+" (%(retcodes)s) .*$

This watches for 402, plus 418 and 444 from other blocking rules I have. If an IP triggers more than 10 of these codes in 5 minutes, fail2ban adds a blackhole route for 24 hours. The jail config in /etc/fail2ban/jail.local looks like this:

[nginx-antiscraper]
enabled  = true
filter   = nginx-antiscraper
action   = route[blocktype=blackhole]
logpath  = /var/log/nginx/*.access.log
findtime = 300
maxretry = 10
bantime  = 86400

This setup catches scrapers quickly. Once they hit the rate limit a few times, they're blocked for a day. If they come back, the pattern repeats.

What Didn't Work

I initially tried blocking User-Agent strings. That failed immediately because scrapers lie. Some rotate agents. Others spoof Chrome or Safari. Maintaining a list of bad agents is pointless when they change constantly.

I also considered blocking entire cloud IP ranges like Google Cloud's 34.174.x.x/16 block, which shows up a lot in my logs. But that's too blunt. Legitimate services run on those IPs too. I'd rather react to behavior than preemptively block infrastructure.

The nodelay setting in the rate limit is still something I'm testing. Without it, Nginx queues excess requests instead of rejecting them immediately. I wanted instant rejection, but I'm not sure yet if that's better. Some scrapers might back off with queuing. Others just retry faster.

Trade-offs I'm Living With

This approach isn't perfect. If a scraper stays under 4 requests per second, it won't trigger anything. A slow, patient bot could still map my entire site. But those bots aren't my problem—they're not loading the server.

There's also the risk of false positives. If someone on a shared IP (like a corporate network) browses aggressively, they might hit the limit. The 20-request burst helps, but it's not foolproof. So far, I haven't seen complaints, but I'm watching for it.

Fail2ban adds a 24-hour ban after repeated violations. That's long enough to stop most scrapers but short enough that a mistaken block isn't permanent. I could extend it, but I'd rather keep it reversible for now.

Key Takeaways

Rate limiting works because it targets behavior, not identity. Scrapers reveal themselves by how they access content, not by what they claim to be.

Combining Nginx's rate limit with fail2ban creates a two-layer defense. The rate limit slows down aggressive requests. Fail2ban removes persistent offenders entirely.

The 402 error code is arbitrary but useful. It's distinct enough to filter on without catching normal errors, and it makes my stance clear.

This setup requires tuning. The 4r/s rate and 20-request burst work for my sites, but yours might need different values. Check your logs to see how many requests a typical page generates.

Nothing here is foolproof. New scrapers will find ways around it. But this approach handles the current wave of aggressive bots without blocking legitimate traffic or maintaining endless blocklists.