Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

building a pi-hole failover cluster with keepalived and unbound recursive dns for zero-downtime ad blocking

Why I Built a Pi-hole Failover Cluster

DNS is the one thing in my network that absolutely cannot go down. When it fails, nothing works—not web browsing, not smart home devices, not even SSH into my servers. For years, I ran a single Pi-hole instance on a Raspberry Pi 3. It worked fine until it didn't.

The first real failure happened during a routine apt upgrade. The Pi locked up mid-update, corrupted the filesystem, and took my entire network offline. I had to manually switch every device to Cloudflare DNS while I rebuilt the Pi from scratch. That's when I decided I needed proper failover.

I didn't want two separate Pi-holes that I had to manually configure and keep in sync. I needed a real cluster—one virtual IP address that would automatically switch between nodes if the primary went down, with settings that stayed consistent across both.

My Setup: Two Nodes, One Virtual IP

I run this on two Proxmox LXC containers, both running Debian 12. Each container has a static IP on my management VLAN:

  • Primary node: 10.0.20.50
  • Backup node: 10.0.20.51
  • Virtual IP (VIP): 10.0.20.49

The virtual IP is what all my devices use for DNS. It floats between the two nodes using VRRP (Virtual Router Redundancy Protocol) through keepalived. If the primary node fails, the backup takes over the VIP within seconds.

I chose LXC containers over VMs because they're lighter and faster to provision. Each container gets 1GB RAM and 2 CPU cores, which is more than enough for Pi-hole and Unbound combined.

Why Keepalived Instead of Other Solutions

I looked at several failover options before settling on keepalived:

  • Gravity Sync – Only syncs blocklists and settings, doesn't handle IP failover
  • Pacemaker/Corosync – Too complex for a two-node DNS setup
  • Manual DNS switching – Requires me to be available when things break

Keepalived does one thing well: it monitors service health and moves the virtual IP when needed. It's simple, reliable, and doesn't try to do too much.

How Keepalived Actually Works Here

Each node runs a keepalived daemon that:

  1. Checks if Pi-hole's DNS service (port 53) is responding
  2. Sends VRRP advertisements to the other node every second
  3. Assigns the virtual IP to whichever node has the highest priority and is healthy

The primary node has priority 150, backup has 140. If the primary fails its health check or stops sending advertisements, the backup claims the VIP automatically.

Adding Unbound for Recursive DNS

By default, Pi-hole forwards DNS queries to upstream servers like Cloudflare or Google. I wanted more privacy, so I added Unbound as a recursive resolver on both nodes.

Instead of asking another DNS server for answers, Unbound queries the authoritative nameservers directly. This means:

  • No third party sees all my DNS queries
  • Slightly slower initial lookups (50-100ms more), but cached responses are instant
  • Full control over DNSSEC validation

I configured Pi-hole to use 127.0.0.1#5335 as its only upstream DNS server. Unbound listens on port 5335 and handles all the recursive lookups.

Unbound Configuration That Matters

Most of Unbound's defaults are fine, but I changed a few things:

server:
    interface: 127.0.0.1@5335
    do-ip4: yes
    do-ip6: no
    prefetch: yes
    prefetch-key: yes
    cache-min-ttl: 300
    serve-expired: yes

The prefetch options tell Unbound to refresh cache entries before they expire, which keeps responses fast. serve-expired means if Unbound can't reach an authoritative server, it'll return a stale cached answer rather than failing completely.

I disabled IPv6 because my ISP doesn't provide native IPv6, and enabling it just adds unnecessary timeouts.

Syncing Configuration Between Nodes

Having two Pi-holes is pointless if their blocklists and settings drift apart. I tried Gravity Sync initially, but it felt fragile—lots of shell scripts and cron jobs that broke when Pi-hole updated.

I switched to a simpler approach: Ansible handles initial provisioning, and I use a tool called Orbital Sync (previously Nebula Sync) to keep the running configuration in sync.

Orbital Sync watches for changes on the primary node and pushes them to the backup every 5 minutes. It syncs:

  • Blocklists and allowlists
  • Local DNS records
  • DHCP settings (though I don't use Pi-hole for DHCP)
  • Group assignments and client settings

It doesn't sync query logs, which is fine—I only care about the primary's logs for monitoring anyway.

What Actually Broke

The first time I tested failover, it didn't work. The backup node took over the VIP correctly, but DNS queries timed out. Turns out I forgot to configure the firewall on the backup node to allow traffic on port 53.

Proxmox LXC containers don't have iptables by default, but I run a simple nftables ruleset that blocks everything except SSH and DNS. I had only applied it to the primary node during initial setup.

Another issue: Unbound on the backup node wasn't starting automatically after a reboot. The systemd unit file was there, but systemctl enable unbound had never been run. This meant if the backup node rebooted, it would fail health checks even though Pi-hole itself was running.

Health Check Tuning

Initially, keepalived checked if port 53 was open, but that's not enough. If Unbound crashes, port 53 is still open (Pi-hole is listening), but queries fail because Pi-hole can't reach its upstream resolver.

I changed the health check to actually query the local Pi-hole:

vrrp_script check_pihole {
    script "/usr/bin/dig @127.0.0.1 google.com +time=2 +tries=1"
    interval 3
    fall 2
    rise 2
}

Now it only considers a node healthy if it can successfully resolve a DNS query end-to-end.

Performance in Practice

With both nodes running, average query time is around 15ms for cached responses and 80ms for recursive lookups. That's slower than using Cloudflare directly (which is usually 10-12ms), but the privacy trade-off is worth it for me.

Failover takes about 3 seconds. When I shut down the primary node, I see a brief spike in query failures, then everything resumes on the backup. Most devices don't even notice because their DNS caches are still valid.

One thing I didn't expect: query load doesn't split between the nodes. Everything hits the primary unless it's down. This is by design with VRRP—only one node is active at a time. If I wanted true load balancing, I'd need to set up round-robin DNS or use a different architecture entirely.

What I'd Change If I Started Over

I'd skip Ansible for ongoing management. It's great for initial provisioning, but once the cluster is running, I rarely need to re-run the playbook. Most changes happen through the Pi-hole web interface, and Orbital Sync handles propagation.

I'd also consider running this on physical Raspberry Pis instead of LXC containers. The containers work fine, but they're dependent on Proxmox being healthy. If my Proxmox host goes down, both DNS nodes go with it. Separate hardware would be more resilient.

Finally, I'd document my firewall rules better from the start. I wasted hours troubleshooting failover issues that were just blocked ports.

Key Takeaways

  • VRRP failover works, but only if you test it properly—don't assume the backup node is configured correctly
  • Health checks need to verify the entire DNS resolution path, not just port availability
  • Unbound adds latency but removes upstream DNS dependencies—worth it if privacy matters to you
  • Configuration sync is critical; manual changes on one node will cause drift
  • Two nodes on the same physical host isn't true high availability, but it handles 90% of failure scenarios

This setup has been running for eight months now with zero unplanned downtime. I've rebooted both nodes for updates without any devices losing DNS. For a home network, that's good enough.