Configuring Nginx stream module for TCP load balancing across multiple WireGuard endpoints with automatic failover using health checks

Why I Needed This

I run multiple WireGuard endpoints across different VPS providers for redundancy. When one provider has network issues or performs maintenance, I need traffic to automatically shift to healthy endpoints without manual intervention. The challenge was finding a way to load balance UDP traffic (WireGuard uses UDP) and detect when an endpoint goes down.

I initially looked at Nginx’s stream module because I already use Nginx for web traffic, and I wanted to keep my infrastructure simple. The question was whether Nginx could handle WireGuard’s specific requirements—UDP protocol, stateful connections, and the need for reliable failover.

My Actual Setup

I have three Ubuntu 22.04 VMs in my Proxmox cluster:

One VM running Nginx as the load balancer (192.168.1.50)
Two VMs each running WireGuard endpoints (192.168.1.51 and 192.168.1.52)
All three connected to my internal network with proper DNS entries

Each WireGuard endpoint listens on UDP port 51820. My goal was to present a single entry point that clients could connect to, with automatic failover if one endpoint failed.

The TCP vs UDP Problem

Here’s where I hit my first real limitation. Nginx’s stream module does support UDP load balancing, but it works fundamentally differently than TCP. With TCP, Nginx can detect connection failures through the TCP handshake and connection state. With UDP, there’s no connection state—it’s just packets flying around.

I configured basic UDP load balancing first to see what would happen:

stream {
    upstream wireguard_backend {
        server 192.168.1.51:51820;
        server 192.168.1.52:51820;
    }
    
    server {
        listen 51820 udp;
        proxy_pass wireguard_backend;
        proxy_timeout 10s;
        proxy_responses 1;
    }
}

This configuration accepted UDP packets on port 51820 and forwarded them using round-robin. But I immediately noticed problems:

WireGuard clients couldn’t establish stable connections because packets were being split between backends
The WireGuard handshake would start with one backend, then subsequent packets would hit the other backend
No real health checking—Nginx had no way to know if a WireGuard endpoint was actually functioning

Hash-Based Routing Solution

The fix for connection stability was using hash-based load balancing instead of round-robin. This ensures packets from the same source IP always hit the same backend:

stream {
    upstream wireguard_backend {
        hash $remote_addr consistent;
        server 192.168.1.51:51820;
        server 192.168.1.52:51820;
    }
    
    server {
        listen 51820 udp;
        proxy_pass wireguard_backend;
        proxy_timeout 10s;
        proxy_responses 1;
    }
}

This worked much better. Each client’s traffic now consistently hit the same backend, allowing WireGuard handshakes to complete properly. But I still had no real failover.

The Health Check Challenge

Nginx open source doesn’t have active health checks for UDP streams. The max_fails and fail_timeout parameters work for TCP because Nginx can detect connection failures, but with UDP, there’s no connection to fail.

I tried adding these parameters anyway:

upstream wireguard_backend {
    hash $remote_addr consistent;
    server 192.168.1.51:51820 max_fails=3 fail_timeout=30s;
    server 192.168.1.52:51820 max_fails=3 fail_timeout=30s;
}

But they didn’t trigger failover when I shut down one WireGuard endpoint. Nginx kept sending packets to the dead endpoint because from Nginx’s perspective, the UDP socket was still accepting packets—there was just nothing responding on the other end.

External Health Monitoring Approach

Since Nginx couldn’t detect WireGuard endpoint health on its own, I built external monitoring. I wrote a simple Python script that runs on the Nginx VM:

#!/usr/bin/env python3
import subprocess
import time
import os

ENDPOINTS = [
    "192.168.1.51:51820",
    "192.168.1.52:51820"
]

def check_wireguard_endpoint(host, port):
    # Send a basic UDP packet and check if we get any response
    # This is crude but works for detecting completely dead endpoints
    cmd = f"timeout 2 nc -u -z {host} {port}"
    result = subprocess.run(cmd, shell=True, capture_output=True)
    return result.returncode == 0

def update_nginx_config(healthy_endpoints):
    config = "stream {
"
    config += "    upstream wireguard_backend {
"
    config += "        hash $remote_addr consistent;
"
    
    for endpoint in healthy_endpoints:
        config += f"        server {endpoint};
"
    
    config += "    }
"
    config += "    server {
"
    config += "        listen 51820 udp;
"
    config += "        proxy_pass wireguard_backend;
"
    config += "        proxy_timeout 10s;
"
    config += "        proxy_responses 1;
"
    config += "    }
"
    config += "}
"
    
    with open("/etc/nginx/stream.conf", "w") as f:
        f.write(config)
    
    subprocess.run(["nginx", "-s", "reload"])

while True:
    healthy = []
    for endpoint in ENDPOINTS:
        host, port = endpoint.split(":")
        if check_wireguard_endpoint(host, int(port)):
            healthy.append(endpoint)
    
    if healthy:
        update_nginx_config(healthy)
    
    time.sleep(10)

This script checks each endpoint every 10 seconds and regenerates the Nginx configuration with only healthy backends. It’s not elegant, but it works.

What Actually Worked

The combination of hash-based routing and external health monitoring gave me functional failover. When I shut down one WireGuard endpoint, the script detected it within 10 seconds and removed it from the upstream configuration. Existing connections stayed pinned to their original backend (if still healthy), and new connections only went to working endpoints.

The consistent parameter in the hash directive was important. Without it, removing a backend would cause all clients to be remapped, breaking active WireGuard sessions. With consistent hashing, only clients that were mapped to the failed backend got remapped.

What Didn’t Work

Several things I tried were dead ends:

Relying on Nginx’s built-in health detection: Simply doesn’t work for UDP without Nginx Plus
Using proxy_responses for health checks: This parameter tells Nginx how many UDP responses to expect, but it doesn’t trigger backend marking as down
Short proxy_timeout values: I tried setting proxy_timeout 1s hoping it would detect dead endpoints faster, but it just caused packet loss for legitimate slow responses
Multiple upstream blocks with different priorities: Nginx stream doesn’t support backup servers for UDP like it does for TCP

Real Limitations I Found

This setup works for my use case, but it has clear limitations:

The health check script is external and adds complexity. If the script crashes, I lose failover.
There’s a 10-second window where traffic might still go to a dead endpoint after it fails.
Nginx config reloads can cause brief packet loss during the reload window.
This doesn’t handle partial failures well—if an endpoint is slow but not dead, Nginx has no way to prefer the faster one.
WireGuard’s own roaming feature (switching endpoints when it detects better paths) conflicts with the load balancer’s sticky routing.

Key Takeaways

Nginx’s stream module can load balance WireGuard traffic, but it’s not a turnkey solution. You need hash-based routing to maintain connection stability and external monitoring for failover. The approach works, but it’s more of a workaround than a proper solution.

If I were starting over, I’d seriously consider using WireGuard’s built-in endpoint list feature on clients instead. WireGuard can handle multiple endpoints natively and will switch between them when it detects problems. That removes the need for a load balancer entirely, though it pushes the complexity to client configuration.

For my specific situation—where I want centralized control and don’t want to update every client when I add or remove endpoints—the Nginx approach works. But it’s definitely more brittle than I’d like, and I’m keeping an eye on other solutions like wgsd (WireGuard Service Discovery) that might handle this more elegantly.

The biggest lesson: UDP load balancing is fundamentally harder than TCP load balancing. Tools designed for TCP often have UDP support bolted on, and it shows. If your protocol is UDP-based and requires stateful connections, test thoroughly before assuming your load balancer will handle it gracefully.

Tech Expert & Vibe Coder

Why I Needed This

My Actual Setup

The TCP vs UDP Problem

Hash-Based Routing Solution

The Health Check Challenge

External Health Monitoring Approach

What Actually Worked

What Didn’t Work

Real Limitations I Found

Key Takeaways

Category:

implementing rate limiting...

setting up caddy as a...

Leave a Comment Cancel reply

Categories

Related Posts

implementing rate limiting for self-hosted api...

setting up caddy as a transparent proxy for...

building automated firewall rule testing with...

About Me

Vipin PG

Tech Expert & Vibe Coder

Configuring Nginx stream module for TCP load balancing across multiple WireGuard endpoints with automatic failover using health checks

Why I Needed This

My Actual Setup

The TCP vs UDP Problem

Hash-Based Routing Solution

The Health Check Challenge

External Health Monitoring Approach

What Actually Worked

What Didn’t Work

Real Limitations I Found

Key Takeaways

Category:

implementing rate limiting...

setting up caddy as a...

Leave a Comment Cancel reply

Subscribe to Newsletter

Categories

Related Posts

implementing rate limiting for self-hosted api...

setting up caddy as a transparent proxy for...

building automated firewall rule testing with...

About Me

Vipin PG