Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Debugging Docker DNS Resolution Failures in Multi-Host Overlay Networks: Fixing Service Discovery Between Swarm Nodes
Here's the article in clean HTML format: html

Debugging Docker DNS Resolution Failures in Multi-Host Overlay Networks

I’ve spent more hours than I’d like to admit chasing down DNS issues in Docker Swarm. When service discovery breaks between nodes in a multi-host overlay network, it’s a uniquely frustrating experience—partly because the symptoms can mimic so many other problems. Here’s what I’ve learned from real-world debugging sessions.

Why I Worked on This

I was setting up a Swarm cluster across three physical nodes to host a mix of services, including a database, API gateway, and a few microservices. Initially, everything seemed fine, but intermittently, containers on Node B couldn’t resolve service names hosted on Node A. The logs would show errors like getaddrinfo ENOTFOUND, but the same service would work fine when accessed from Node A itself.

My Real Setup

  • Docker Engine: Version 20.10.x across all nodes (running on Ubuntu 20.04).
  • Swarm: Three-node setup with an overlay network (my_overlay_net).
  • Services: A mix of stateful (PostgreSQL) and stateless (Nginx, custom APIs) services.
  • Networking: Default ingress overlay network for service discovery.

What Worked (and Why)

  1. Verified DNS Server Configuration: Each node’s /etc/resolv.conf pointed to the same internal DNS server (I use Pi-hole for local DNS). This ensured consistency in host-level DNS resolution.

  2. Inspected Overlay Network DNS: Docker Swarm’s internal DNS is handled by the overlay network. Running docker network inspect <network_name> revealed that the InternalDNSName was correctly set to my_overlay_net, and the DNS field showed the Swarm’s built-in DNS resolver (169.254.169.253).

  3. Checked Service Labels: I confirmed that all services had the correct endpoint-mode=vnf label, which ensures virtual network function (VNF) compliance for service discovery. This is critical for multi-host environments.

  4. Forced DNS Refresh: Sometimes, stale DNS cache on nodes can cause issues. Restarting the Docker service (systemctl restart docker) or even the entire node helped in a few cases. Not ideal, but effective for troubleshooting.

What Didn’t Work

  • Ignoring Host Network Conflicts: Initially, I assumed the issue was purely Docker-related. However, I later discovered that a misconfigured iptables rule on one node was blocking DNS traffic to the Swarm’s internal resolver (169.254.169.253). Always check host-level networking too!

  • Over-relying on nslookup: While nslookup is useful, it doesn’t always reflect how Docker’s internal DNS behaves. For example, nslookup service_name might work from the host but fail inside the container due to different DNS contexts.

  • Assuming All Nodes Were Equal: One node in the cluster had an older kernel (4.15.x), which caused issues with overlay network routing. Upgrading to 5.x resolved intermittent connectivity problems.

Key Takeaways

  1. Docker Swarm’s DNS is internal-first: Containers resolve service names through the overlay network’s DNS, not the host’s. Always check docker network inspect for the correct resolver.

  2. Host networking matters: Firewalls, iptables, or even outdated kernels can break Swarm’s DNS. Don’t assume the issue is isolated to Docker.

  3. Service labels are not optional: Missing or incorrect labels (like endpoint-mode) can silently break service discovery. Double-check your docker-compose.yml or docker service create commands.

  4. Restarting helps, but isn’t a fix: If a simple restart resolves DNS issues, it’s a sign of a deeper problem (e.g., DNS cache or networking instability). Investigate further.

Debugging DNS in a multi-host Docker Swarm is a mix of understanding Docker’s internal networking and ensuring your host infrastructure isn’t interfering. The key is to methodically check both layers—Docker’s overlay network and the underlying host—without assuming the problem lies in just one place.