Debugging Docker DNS Resolution Failures in Multi-Host Overlay Networks
I’ve spent more hours than I’d like to admit chasing down DNS issues in Docker Swarm. When service discovery breaks between nodes in a multi-host overlay network, it’s a uniquely frustrating experience—partly because the symptoms can mimic so many other problems. Here’s what I’ve learned from real-world debugging sessions.
Why I Worked on This
I was setting up a Swarm cluster across three physical nodes to host a mix of services, including a database, API gateway, and a few microservices. Initially, everything seemed fine, but intermittently, containers on Node B couldn’t resolve service names hosted on Node A. The logs would show errors like getaddrinfo ENOTFOUND, but the same service would work fine when accessed from Node A itself.
My Real Setup
- Docker Engine: Version 20.10.x across all nodes (running on Ubuntu 20.04).
- Swarm: Three-node setup with an overlay network (
my_overlay_net). - Services: A mix of stateful (PostgreSQL) and stateless (Nginx, custom APIs) services.
- Networking: Default
ingressoverlay network for service discovery.
What Worked (and Why)
-
Verified DNS Server Configuration: Each node’s
/etc/resolv.confpointed to the same internal DNS server (I use Pi-hole for local DNS). This ensured consistency in host-level DNS resolution. -
Inspected Overlay Network DNS: Docker Swarm’s internal DNS is handled by the overlay network. Running
docker network inspect <network_name>revealed that theInternalDNSNamewas correctly set tomy_overlay_net, and theDNSfield showed the Swarm’s built-in DNS resolver (169.254.169.253). -
Checked Service Labels: I confirmed that all services had the correct
endpoint-mode=vnflabel, which ensures virtual network function (VNF) compliance for service discovery. This is critical for multi-host environments. -
Forced DNS Refresh: Sometimes, stale DNS cache on nodes can cause issues. Restarting the Docker service (
systemctl restart docker) or even the entire node helped in a few cases. Not ideal, but effective for troubleshooting.
What Didn’t Work
-
Ignoring Host Network Conflicts: Initially, I assumed the issue was purely Docker-related. However, I later discovered that a misconfigured iptables rule on one node was blocking DNS traffic to the Swarm’s internal resolver (169.254.169.253). Always check host-level networking too!
-
Over-relying on
nslookup: Whilenslookupis useful, it doesn’t always reflect how Docker’s internal DNS behaves. For example,nslookup service_namemight work from the host but fail inside the container due to different DNS contexts. -
Assuming All Nodes Were Equal: One node in the cluster had an older kernel (4.15.x), which caused issues with overlay network routing. Upgrading to 5.x resolved intermittent connectivity problems.
Key Takeaways
-
Docker Swarm’s DNS is internal-first: Containers resolve service names through the overlay network’s DNS, not the host’s. Always check
docker network inspectfor the correct resolver. -
Host networking matters: Firewalls, iptables, or even outdated kernels can break Swarm’s DNS. Don’t assume the issue is isolated to Docker.
-
Service labels are not optional: Missing or incorrect labels (like
endpoint-mode) can silently break service discovery. Double-check yourdocker-compose.ymlordocker service createcommands. -
Restarting helps, but isn’t a fix: If a simple restart resolves DNS issues, it’s a sign of a deeper problem (e.g., DNS cache or networking instability). Investigate further.
Debugging DNS in a multi-host Docker Swarm is a mix of understanding Docker’s internal networking and ensuring your host infrastructure isn’t interfering. The key is to methodically check both layers—Docker’s overlay network and the underlying host—without assuming the problem lies in just one place.