Why I Worked on This
I run a multi-node Docker Swarm cluster at home across three Proxmox hosts. One day, a distributed application I was testing started dropping packets randomly. Pings would succeed, then fail, then succeed again. Services would time out intermittently. The logs showed nothing obvious, and CPU and memory usage were normal.
After ruling out hardware issues and DNS problems, I suspected the overlay network. What I found was an MTU mismatch between my physical network interfaces and the Docker overlay network's VXLAN encapsulation. This wasn't immediately obvious because most traffic worked fine—only certain packet sizes triggered the issue.
My Real Setup
I have three Proxmox nodes running Docker in Swarm mode:
- Node 1: Dell OptiPlex 7050 (manager node)
- Node 2: HP EliteDesk 800 G3 (worker)
- Node 3: Lenovo ThinkCentre M720q (worker)
All nodes connect through a Ubiquiti EdgeRouter X with gigabit ethernet. The physical interfaces use the default MTU of 1500 bytes. Docker Swarm creates overlay networks using VXLAN, which adds 50 bytes of overhead for encapsulation.
I was running a distributed Redis cluster and a custom Python application that processed data between nodes. Both used an overlay network I created called app-net.
How I Found the Problem
The first clue was inconsistency. Small requests worked perfectly. Larger payloads sometimes succeeded, sometimes failed. I started by checking basic connectivity:
docker exec -it redis-node-1 ping redis-node-2
Ping worked, but with occasional packet loss—around 5-10%. That ruled out complete network failure but pointed to something specific about packet handling.
I checked the overlay network details:
docker network inspect app-net
The network existed and showed all containers properly attached. No errors in the output. I moved to the host level and ran tcpdump on the physical interface while generating traffic:
sudo tcpdump -i enp0s31f6 -n host 192.168.1.12
I saw fragmented packets and retransmissions. That was the smoking gun. Fragmentation meant packets were too large for the network path, and retransmissions meant they were being dropped somewhere.
Understanding the MTU Problem
Docker's overlay network uses VXLAN to encapsulate traffic between nodes. VXLAN adds 50 bytes of headers to each packet. If your physical network has an MTU of 1500 bytes, the effective MTU inside the overlay network is 1450 bytes.
When a container sends a 1500-byte packet through the overlay network, Docker tries to encapsulate it, resulting in a 1550-byte packet. This exceeds the physical interface's MTU, causing the packet to be fragmented or dropped, depending on the DF (Don't Fragment) flag.
I confirmed this by checking the MTU on the overlay network interface inside a container:
docker exec -it redis-node-1 ip link show eth0
The output showed mtu 1450, which was correct. But my application wasn't respecting this—it was trying to send larger packets, likely because it was configured with a default MTU of 1500.
What Worked
I fixed this in two steps: adjusting the Docker overlay network MTU and verifying application behavior.
Step 1: Set the Correct MTU on the Overlay Network
I recreated the overlay network with an explicit MTU setting:
docker network rm app-net docker network create \ --driver overlay \ --opt com.docker.network.driver.mtu=1450 \ app-net
This ensured the overlay network explicitly used 1450 bytes, accounting for VXLAN overhead. I redeployed the services:
docker service update --network-rm app-net --network-add app-net redis-cluster docker service update --network-rm app-net --network-add app-net python-app
Step 2: Verify with iperf3
I deployed two test containers on different nodes and ran iperf3 to measure throughput and packet behavior:
docker run -d --name iperf-server --network app-net networkstatic/iperf3 -s docker run --rm --network app-net networkstatic/iperf3 -c iperf-server -t 30
After the MTU fix, throughput stabilized and packet loss dropped to zero. Before the fix, I saw 5-10% packet loss and significantly lower throughput.
Step 3: Persistent Configuration
To avoid recreating the network manually every time, I added the MTU setting to my Docker Compose file:
networks:
app-net:
driver: overlay
driver_opts:
com.docker.network.driver.mtu: 1450
This ensured the setting persisted across deployments.
What Didn't Work
Before finding the MTU issue, I tried several things that didn't help:
- Increasing kernel buffer sizes: I adjusted
net.core.rmem_maxandnet.core.wmem_maxthinking it was a buffering problem. It made no difference because the issue wasn't buffer exhaustion—it was packet size. - Switching to host networking: I tested running the Redis cluster with
--network hostto bypass the overlay network entirely. This worked but broke the distributed setup because services couldn't discover each other across nodes. - Disabling IPv6: I read somewhere that IPv6 can cause issues with Docker networking. I disabled it on all nodes. No change. The problem was specific to packet size, not IP version.
- Restarting Docker services: I restarted the Docker daemon on all nodes multiple times. This temporarily cleared some state but didn't fix the underlying MTU mismatch.
Debugging Tools I Used
These tools were essential for diagnosing the issue:
- tcpdump: Captured packets on the physical interface to see fragmentation and retransmissions.
- iperf3: Measured throughput and packet loss between containers.
- docker network inspect: Verified overlay network configuration and container attachments.
- ip link show: Checked MTU settings on both physical and virtual interfaces.
- ping with specific packet sizes:
ping -M do -s 1450 192.168.1.12tested whether packets of a specific size could traverse the network without fragmentation.
Key Takeaways
MTU mismatches are subtle. Most traffic works fine, so the problem doesn't show up in basic tests. You only notice it when applications send larger packets or when network conditions change.
Docker's overlay network uses VXLAN, which adds 50 bytes of overhead. If your physical network has an MTU of 1500, your overlay network should use 1450. Always set this explicitly when creating overlay networks.
Packet loss and fragmentation are clear signs of MTU issues. Use tcpdump and iperf3 to confirm before changing configurations.
Host networking bypasses overlay issues but breaks service discovery in multi-node setups. It's not a real solution for distributed applications.
Always test network changes with real traffic, not just pings. Pings use small packets and won't reveal MTU problems.