Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Building Multi-ISP Failover with Policy-Based Routing and Health Checks

Why I Built Multi-ISP Failover

I run my home network on dual ISPs because single points of failure drive me crazy. My primary fiber connection is fast and stable most of the time, but when it goes down—whether from construction crews, ISP maintenance, or random outages—I need my network to stay up. My monitoring systems, DNS servers, and remote access all depend on continuous connectivity. The secondary connection is a slower cable line. I don't want to load-balance across both links because the latency and bandwidth characteristics are different enough to cause problems with some services. What I needed was automatic failover: use the primary until it fails, switch to backup immediately, and switch back when primary recovers.

My Setup

I'm running this on an OPNsense firewall with two WAN interfaces:
  • WAN1: 500/500 Mbps fiber (primary)
  • WAN2: 100/10 Mbps cable (backup)
Both connections terminate on separate physical interfaces. Each ISP provides a gateway, and I configured static IPs on both WANs. My internal network is a flat /24 subnet behind the firewall. The goal was policy-based routing with health monitoring: all traffic uses WAN1 by default, but if WAN1 fails health checks, traffic should automatically route through WAN2.

Health Check Configuration

OPNsense has a built-in gateway monitoring system called dpinger. It sends ICMP probes to specified targets and tracks packet loss and latency. I configured health checks for both gateways:
  • WAN1 monitor target: 1.1.1.1 (Cloudflare DNS)
  • WAN2 monitor target: 8.8.8.8 (Google DNS)
I set the probe interval to 1 second with a 500ms timeout. The gateway is marked as down after 5 consecutive probe failures, which means roughly 5 seconds to detect an outage. I initially used my ISPs' gateway IPs as monitor targets, but that caused a problem: if the ISP's gateway was up but their upstream routing was broken, health checks passed while actual internet connectivity was dead. Using external public DNS servers solved this—they're reliable targets and test the full path.

Policy-Based Routing Rules

OPNsense uses gateway groups for failover. I created a gateway group with:
  • Tier 1: WAN1 (priority 1)
  • Tier 2: WAN2 (priority 2)
The group is set to failover mode, not load-balance. This means all traffic uses the highest-priority available gateway. Then I configured firewall rules on the LAN interface to use this gateway group as the default route. The critical part is making sure the default rule sends traffic through the gateway group, not directly to a single WAN interface.

Rule Order Matters

I had to be careful with rule ordering. My firewall rules are processed top-to-bottom, and I have some traffic that needs to stay on specific interfaces (like VPN connections bound to WAN1). Those rules sit above the default gateway group rule so they don't get caught by the failover logic.

What Worked

Failover happens within 5-10 seconds of WAN1 going down. I tested this by physically unplugging the fiber connection and watching traffic shift to WAN2. My SSH sessions stayed alive, monitoring continued, and DNS kept resolving. When WAN1 comes back up, traffic switches back automatically. There's no manual intervention needed. Using external DNS servers as health check targets proved reliable. I've had zero false positives where the system thought a link was down when it wasn't.

What Didn't Work

Monitoring the ISP Gateway Directly

As mentioned, this was my first attempt. It failed because ISP gateways can be reachable while their upstream is broken. I spent an hour troubleshooting why my internet was "working" according to the firewall but nothing actually loaded.

Too Aggressive Health Check Timing

I initially set the failure threshold to 3 probes with a 1-second interval. This caused flapping during brief network hiccups—traffic would bounce between WANs unnecessarily. Increasing to 5 probes made the system less sensitive to transient packet loss.

Asymmetric Routing Issues

This one was subtle. When traffic failed over to WAN2, some return packets tried to come back through WAN1's interface because of how the ISP routed responses. This broke stateful firewall sessions. I fixed it by enabling reply-to on the firewall rules, which forces return traffic to use the same interface it came in on. Without this, connections would hang during failover.

Monitoring and Validation

I use n8n to scrape the OPNsense API every minute and log gateway status. This gives me a history of failover events and health check results. I can see exactly when WAN1 went down and how long the failover took. The dashboard shows:
  • Current active gateway
  • Packet loss percentage for each WAN
  • Latency to monitor targets
  • Timestamp of last failover event
This data helped me tune the health check thresholds. I could see that normal packet loss on WAN1 was under 0.5%, so setting the failure threshold at 5% gave enough margin to avoid false positives.

Limitations and Trade-offs

This setup works for most traffic, but there are edge cases:
  • Stateful sessions break during failover: Long-running TCP connections (like SSH) will drop when the active WAN changes. There's no way around this without more complex setups like MPTCP, which I haven't explored.
  • IP address changes: Services that rely on my public IP (like dynamic DNS or remote access) need to handle the fact that my IP changes during failover. I run my own DNS and update records automatically, but this adds complexity.
  • Bandwidth mismatch: WAN2 is significantly slower than WAN1. During failover, everything still works, but large downloads or video streams take a hit. This is acceptable for my use case since failover is temporary.

Key Takeaways

  • Health checks must test the full internet path, not just the ISP gateway.
  • Failover timing needs tuning based on your network's normal behavior. Too sensitive causes flapping; too slow leaves you offline longer.
  • Reply-to rules are critical for avoiding asymmetric routing issues during failover.
  • Monitoring failover events over time reveals patterns and helps validate the system actually works when you need it.
This setup has been running for over a year. I've had three unplanned WAN1 outages in that time, and failover worked correctly in all cases. The system isn't perfect—stateful sessions drop, and bandwidth takes a hit—but it keeps my network operational when the primary link fails, which is exactly what I needed.