Why I Needed Rate Limiting Across Swarm Nodes
I run several internal APIs on my Docker Swarm cluster—things like custom data processing endpoints, webhook receivers, and automation triggers. These services are exposed through Traefik, which handles routing and TLS termination. For a while, I didn’t have any rate limiting in place because everything was internal and controlled.
That changed when I started exposing a few endpoints externally for n8n workflows and third-party integrations. I quickly realized two problems:
- A misconfigured automation loop hit one endpoint hundreds of times in under a minute.
- Without centralized throttling, each Swarm node handled requests independently, so the effective limit was multiplied by the number of replicas.
I needed a way to enforce consistent rate limits across all nodes without modifying every service individually. That’s when I combined Traefik’s middleware with Redis as a shared state store.
My Setup: Traefik, Docker Swarm, and Redis
My cluster runs three Swarm manager nodes, each with Traefik deployed as a global service. All incoming traffic hits Traefik first, which then routes requests to the appropriate backend containers.
I already had a Redis instance running for other automation tasks (caching n8n data, storing Cronicle job states). This made it the obvious choice for sharing rate limit counters across Traefik instances.
The key components:
- Traefik 2.10+ (supports rate limiting middleware)
- Redis 7.x (single instance, not clustered)
- Docker Swarm with overlay networking
- Traefik labels on services for dynamic configuration
I didn’t use Traefik’s built-in InFlightReq middleware because it only tracks concurrent connections, not request rates over time. I needed something that could say “10 requests per minute, period” and enforce that globally.
Configuring Traefik Middleware for Rate Limiting
Traefik’s rate limiting middleware works by tracking request counts per client IP (or custom key) within a time window. By default, it stores this data in memory, which doesn’t work in a distributed setup because each Traefik instance has its own memory.
To make it work across nodes, I configured Traefik to use Redis as the storage backend.
Here’s the middleware definition I added to my Traefik static configuration (traefik.yml):
http:
middlewares:
api-rate-limit:
rateLimit:
average: 10
period: 1m
burst: 5
sourceCriterion:
requestHeaderName: X-Forwarded-For
This configuration:
- Allows an average of 10 requests per minute
- Permits a burst of up to 5 additional requests
- Uses the X-Forwarded-For header to identify clients (important when behind a reverse proxy)
But this alone still stores counters locally. To use Redis, I had to enable the Redis storage plugin.
Connecting Traefik to Redis
Traefik doesn’t natively support Redis for rate limiting out of the box. I had to use the experimental plugin system, which felt brittle at first but has been stable in production for months now.
I added this to my Traefik static config:
experimental:
plugins:
traefik-redis-ratelimit:
moduleName: github.com/XciD/traefik-redis-ratelimit
version: v0.2.0
providers:
plugin:
traefik-redis-ratelimit:
redis:
endpoints:
- "redis:6379"
password: ""
db: 0
Then I updated my middleware definition to use the plugin:
http:
middlewares:
api-rate-limit:
plugin:
traefik-redis-ratelimit:
average: 10
period: 60
burst: 5
sourceHeader: X-Forwarded-For
The Redis service is reachable via the overlay network at redis:6379. I didn’t enable authentication on Redis because it’s only accessible within the Swarm network, but I would absolutely use a password if it were exposed externally.
Applying the Middleware to Services
Once the middleware was defined, I applied it to specific services using Traefik labels in my Docker Compose files.
Here’s an example for a custom API service:
services:
custom-api:
image: my-registry/custom-api:latest
deploy:
replicas: 3
labels:
- "traefik.enable=true"
- "traefik.http.routers.custom-api.rule=Host(`api.vipinpg.com`)"
- "traefik.http.routers.custom-api.entrypoints=websecure"
- "traefik.http.routers.custom-api.tls.certresolver=letsencrypt"
- "traefik.http.routers.custom-api.middlewares=api-rate-limit@file"
- "traefik.http.services.custom-api.loadbalancer.server.port=8080"
networks:
- traefik-public
The key line is:
traefik.http.routers.custom-api.middlewares=api-rate-limit@file
This tells Traefik to apply the api-rate-limit middleware (defined in the static config) to this router. The @file suffix indicates it’s loaded from the file provider, not dynamic Docker labels.
What Worked
After deploying this setup, I tested it by hitting the endpoint from multiple machines simultaneously. Redis correctly tracked requests across all Traefik instances, and the rate limit was enforced globally.
I confirmed this by watching Redis keys in real-time:
redis-cli --scan --pattern "ratelimit:*"
Each client IP had a corresponding key with a TTL matching the period window. When the limit was hit, Traefik returned a 429 Too Many Requests response immediately, without forwarding the request to the backend service.
This was exactly what I needed—the backend services never saw the excess traffic, and the rate limit applied consistently regardless of which Swarm node handled the request.
What Didn’t Work
The first version of my config used the default sourceCriterion which relies on the remote IP address. In my setup, all requests appear to come from the Traefik container’s IP because of how Docker networking works. This meant every request looked like it came from the same source, so the rate limit applied globally instead of per-client.
I fixed this by switching to X-Forwarded-For, which Traefik populates with the real client IP. This worked, but only because I control the upstream proxy. If you’re behind Cloudflare or another CDN, you’d need to use their specific header (like CF-Connecting-IP).
Another issue: the plugin doesn’t support Redis Sentinel or cluster mode. My Redis instance is a single container, which is a single point of failure. I haven’t addressed this yet because my Redis uptime has been solid, but it’s a known limitation.
Redis Performance and Key Expiry
I was concerned about Redis becoming a bottleneck, but in practice, it handles the load without issue. Each rate limit check is a single GET and INCR operation, which Redis does in microseconds.
The keys automatically expire based on the configured period, so there’s no manual cleanup needed. I monitor Redis memory usage through my Grafana dashboard, and it’s never exceeded 50MB even with dozens of endpoints and hundreds of requests per minute.
One thing I learned: if Redis goes down, Traefik fails open by default, meaning requests pass through without rate limiting. I considered this acceptable for my use case, but you could configure it to fail closed by setting a fallback behavior in the plugin config.
Per-Endpoint vs. Global Policies
I started by defining one global rate limit policy and applying it everywhere. This was too restrictive for some endpoints and too lenient for others.
Now I use multiple named policies:
strict: 5 requests per minute for sensitive endpointsstandard: 20 requests per minute for most APIsrelaxed: 100 requests per minute for internal-only services
I apply these using the same Traefik label pattern, just swapping the policy name. This gives me fine-grained control without managing separate Redis instances or complex logic.
Handling 429 Responses
By default, Traefik returns a plain 429 Too Many Requests with no body. This was confusing for API consumers, so I created a custom error page middleware.
I added this to my Traefik config:
http:
middlewares:
rate-limit-error:
errors:
status:
- "429"
service: error-handler
query: "/errors/rate-limit.html"
Then I deployed a simple Nginx container serving static error pages. The rate-limit.html file includes a JSON response with retry-after information.
This middleware is chained with the rate limiter:
traefik.http.routers.custom-api.middlewares=api-rate-limit@file,rate-limit-error@file
Now clients get a structured response instead of a bare 429, which makes debugging much easier.
Monitoring and Alerting
I track rate limit hits using Traefik’s Prometheus metrics. The traefik_entrypoint_requests_total metric includes a code label, so I can filter for 429 responses.
I set up a Grafana alert that triggers if any endpoint sees more than 50 rate limit rejections in 5 minutes. This usually indicates either a misconfigured client or an actual attack attempt.
I also log all 429 responses to a separate file using Traefik’s access log configuration. This gives me a historical record of which IPs are getting throttled and when.
Key Takeaways
- Traefik’s built-in rate limiting only works per-instance unless you add external state storage.
- Redis is simple and fast enough for this use case, but it’s a single point of failure in my setup.
- The plugin system works but feels less stable than native Traefik features—expect to test thoroughly after updates.
- Always use
X-Forwarded-Foror equivalent when behind a proxy, or you’ll rate limit yourself. - Different endpoints need different limits—don’t assume one policy fits all.
- Custom error responses make rate limiting far more user-friendly.
This setup has been running for several months now without issues. It’s not perfect (the Redis dependency bothers me), but it solved the immediate problem of distributed throttling without adding complexity to my backend services.