Why I Worked on This
I run a three-node Proxmox cluster at home. Each node hosts Docker containers, and I needed them to talk to each other across physical machines without building a full Kubernetes setup. Docker Swarm's overlay networking seemed like the obvious choice—until packets started disappearing.
The symptoms were frustrating. Containers on the same host worked fine. Containers on different hosts would connect sometimes, then drop packets randomly. Pings would succeed, then fail. HTTP requests would hang mid-transfer. It wasn't consistent enough to blame a single misconfiguration, but it was consistent enough to break my self-hosted services.
I needed to understand what was actually happening at the network layer, not just apply generic fixes from forums.
My Real Setup
Three Proxmox nodes connected through a Netgear managed switch. Each node runs Docker in swarm mode. I created an overlay network called apps for services that need to span hosts—things like n8n, Cronicle, PostgreSQL replicas, and monitoring tools.
The overlay network uses VXLAN encapsulation. Docker handles this automatically when you initialize a swarm and create an overlay network. The encapsulated packets travel over the physical network between nodes, and Docker strips the VXLAN headers on arrival.
My initial configuration:
- Docker version 24.0.7 on all nodes
- MTU set to 1500 on physical interfaces
- No custom iptables rules beyond what Docker creates
- VXLAN traffic not explicitly allowed in Proxmox firewall
That last point turned out to matter.
What Didn't Work
I started by checking the obvious things. DNS resolution worked—containers could resolve each other's names through Docker's embedded DNS. IP connectivity worked too, at least initially. Running docker exec container_name ping other_container would succeed for a few packets, then start timing out.
I checked MTU settings because VXLAN adds 50 bytes of overhead. If your physical network uses MTU 1500, the overlay network needs MTU 1450 to avoid fragmentation. I set this explicitly:
docker network create --driver overlay --opt com.docker.network.driver.mtu=1450 apps
Packet loss continued.
I ran tcpdump on both the physical interface and inside containers. On the physical interface, I could see VXLAN packets going out but not always coming back. Inside containers, I saw connection attempts that never completed. The traffic was leaving the source node but not reaching the destination.
I suspected the switch. I checked for storm control settings, broadcast limits, anything that might throttle VXLAN's UDP traffic on port 4789. Nothing obvious. I even tried connecting two nodes directly with a crossover cable to eliminate the switch entirely. Same problem.
What Actually Worked
The issue was firewall rules at multiple layers, plus one Docker configuration I didn't expect to matter.
Proxmox Firewall
Proxmox has its own firewall that sits between the physical network and VMs or containers. By default, it blocks traffic that doesn't match explicit rules. VXLAN uses UDP port 4789, and I hadn't allowed it.
I added rules on each node's datacenter firewall:
Direction: In Action: Accept Protocol: UDP Dest Port: 4789 Source: [IP range of other nodes]
This immediately reduced packet loss but didn't eliminate it.
Docker Swarm Encryption
Docker overlay networks can use IPsec encryption. I had enabled this thinking it would add security without cost. It did add cost—specifically, CPU overhead on every packet. When multiple containers on different nodes tried to communicate simultaneously, the encryption processing couldn't keep up.
I recreated the overlay network without encryption:
docker network create --driver overlay --attachable apps
Packet loss dropped to near zero. I'm not routing this traffic over the internet, so encryption between nodes in my own rack doesn't add meaningful security anyway.
MTU Path Discovery
Even with MTU set to 1450 on the overlay network, some applications weren't respecting it. TCP connections would establish, then stall when trying to send larger packets. The solution was enabling Path MTU Discovery on the Docker daemon itself.
In /etc/docker/daemon.json on each node:
{
"mtu": 1450,
"ip-forward": true,
"ip-masq": true
}
After restarting Docker, fragmentation issues stopped.
Monitoring What Actually Happens
I set up persistent monitoring to catch future problems early. On each node, I run a simple script in a cron job that pings containers on other nodes and logs failures:
#!/bin/bash
TARGETS=("node2_container_ip" "node3_container_ip")
for target in "${TARGETS[@]}"; do
if ! ping -c 3 -W 2 "$target" > /dev/null 2>&1; then
echo "$(date): Failed to reach $target" >> /var/log/overlay-health.log
fi
done
This runs every five minutes. When packet loss returns, I see it in the logs before users notice.
How I Debug This Now
When overlay networking breaks, I follow this sequence:
1. Verify VXLAN traffic reaches the destination node
On the destination node, capture VXLAN packets:
tcpdump -i eth0 -n udp port 4789
If you see nothing when a container on another node tries to connect, the problem is between nodes—firewall, routing, or physical network.
2. Check inside the overlay network
Exec into a container and inspect its network interface:
docker exec container_name ip addr show eth0 docker exec container_name ip route
Verify the MTU is 1450 and the default gateway points to Docker's overlay network gateway (usually ends in .1).
3. Test with minimal load
Stop all services except two test containers on different nodes. If they communicate perfectly under no load but fail under normal load, you're hitting resource limits—CPU, network bandwidth, or connection tracking table size.
4. Check Docker daemon logs
On each node:
journalctl -u docker.service -f
Look for errors about network driver failures, VXLAN setup issues, or swarm gossip protocol problems.
Key Takeaways
Overlay networks work well once configured correctly, but they add complexity that isn't obvious from documentation. VXLAN encapsulation means every layer of your network stack—physical NICs, switches, firewalls, Docker daemon—needs to handle the extra overhead.
The biggest lesson: when debugging distributed systems, test at each layer independently. I wasted time assuming the problem was in Docker when it was actually Proxmox's firewall blocking UDP 4789. I assumed encryption was free when it wasn't.
For homelab use, I now run overlay networks without encryption and with MTU set to 1450. I monitor packet loss actively rather than waiting for services to break. And I keep a test container on each node that does nothing but ping the others, so I know immediately when cross-node communication degrades.
If you're running Docker across multiple physical hosts, expect to spend time tuning the network layer. The defaults aren't wrong, but they're optimized for cloud environments with different constraints than a homelab cluster.