Why I Built This
I run my own DNS infrastructure because I don't trust third parties with my routing decisions. Living in a region where international connectivity is unpredictable, I've watched subsea cable cuts take down entire services for hours while DNS kept pointing traffic at dead links.
The problem: my primary ISP routes international traffic through specific subsea cables. When one goes down, latency spikes to 800ms+ or connections just timeout. But my DNS records don't know about this. They keep sending users to servers that are technically up but practically unreachable.
I needed a system that could detect cable degradation in real-time and automatically switch DNS routing to backup paths. Not after I wake up and notice. Not after users complain. Immediately.
My Setup Before This
I was already running n8n for various automation tasks on a Proxmox VM. It handles monitoring for my self-hosted services, scrapes data I need for processing, and triggers alerts. I use Cloudflare for DNS because their API is solid and I can script changes without fighting their interface.
My infrastructure has two internet paths:
- Primary ISP: fiber connection with good international routing when cables are healthy
- Secondary ISP: slower but uses different cable systems
I host some services at home and some on a VPS. The VPS becomes the failover target when home international connectivity degrades.
Finding Subsea Cable Status Data
This was harder than it should be. Most cable status information is locked behind industry portals or simply not published. I found two sources that actually work:
TeleGeography's API: They maintain a database of major cable systems but don't provide real-time status. I use this to map which cables my ISP likely uses based on their network path.
Manual monitoring of ISP routing: I run traceroutes from multiple points and compare latency patterns. When my usual Singapore hop jumps from 45ms to 600ms, I know something's wrong with the direct cable.
I don't have access to official cable status APIs. The workflow I built doesn't rely on them. Instead, it monitors the actual impact: latency and packet loss to key international endpoints.
The n8n Workflow Structure
I created a workflow that runs every 2 minutes. Here's what it actually does:
Step 1: Health Check Probes
The workflow starts with HTTP Request nodes hitting multiple targets:
- My VPS in Singapore
- Cloudflare's 1.1.1.1
- Google's 8.8.8.8
- A server I know routes through the specific cable I'm monitoring
I'm not measuring if they're up (they always are). I'm measuring response time. Each request has a 5-second timeout. I record both success/failure and latency.
Step 2: Latency Analysis
This is where most automated monitoring fails. A single slow response doesn't mean the cable is down. Network jitter happens. I needed to detect sustained degradation.
I use n8n's built-in functions to calculate a rolling average over the last 5 checks. The workflow stores these values in a simple JSON file on the VM. If the 5-check average exceeds my threshold (200ms for Singapore, 150ms for Cloudflare), that's a real problem.
Here's the function code I use in the workflow:
const latencyHistory = $('ReadHistory').first().json.history || [];
const currentLatency = $input.item.json.responseTime;
latencyHistory.push({
timestamp: Date.now(),
latency: currentLatency
});
// Keep only last 5 checks
if (latencyHistory.length > 5) {
latencyHistory.shift();
}
const avgLatency = latencyHistory.reduce((sum, item) =>
sum + item.latency, 0) / latencyHistory.length;
return {
json: {
history: latencyHistory,
average: avgLatency,
threshold_exceeded: avgLatency > 200
}
};
Step 3: Decision Logic
The workflow branches based on the threshold check. If average latency is normal, it does nothing. If it's exceeded, it checks the current DNS state.
I maintain a state file that tracks:
- Current active DNS target (home or VPS)
- Last failover timestamp
- Number of failovers in the last 24 hours
This prevents flapping. If I've already failed over in the last 10 minutes, the workflow logs the issue but doesn't change DNS again. Cable problems don't resolve in seconds, and I don't want DNS records bouncing back and forth.
Step 4: DNS Switching
When failover is needed, the workflow calls Cloudflare's API to update my A records. I have this set up for my primary domain and a few subdomains that point to self-hosted services.
The API call looks like this:
PUT https://api.cloudflare.com/client/v4/zones/{zone_id}/dns_records/{record_id}
{
"type": "A",
"name": "service.vipinpg.com",
"content": "VPS_IP_ADDRESS",
"ttl": 300,
"proxied": false
}
I use a 300-second TTL specifically for this. Most DNS resolvers will honor it, which means failover takes effect within 5 minutes for most users. Not instant, but acceptable for my use case.
Step 5: Notifications
After DNS changes, the workflow sends me a Telegram message with:
- Which threshold was exceeded
- Current average latency
- Which DNS target is now active
- Timestamp of the change
I also log everything to a local file so I can analyze patterns later.
What Worked
The rolling average approach eliminated false positives. In the first version, I triggered on single slow responses and ended up with 15 unnecessary failovers in one week. The 5-check average gives me confidence that something is actually wrong.
Storing state in a simple JSON file works perfectly for this. I don't need a database. The file is small, reads are fast, and I can inspect it manually when debugging.
The flapping prevention is critical. Without it, I'd have DNS records changing every few minutes during unstable periods. The 10-minute lockout means I accept some downtime in exchange for stability.
Cloudflare's API has been reliable. I've made hundreds of automated DNS changes and never hit rate limits or seen API failures.
What Didn't Work
My first attempt used ping instead of HTTP requests. Turns out many networks deprioritize ICMP, so I was seeing packet loss that didn't reflect actual TCP performance. HTTP requests are more accurate for this use case.
I tried monitoring more endpoints (10+) thinking more data would be better. It just made the workflow slow and didn't improve accuracy. Four well-chosen targets tell me everything I need to know.
I initially set the threshold at 300ms. Too high. By the time latency hit 300ms, users were already experiencing timeouts. 200ms is the sweet spot where I catch problems before they become severe.
The workflow doesn't automatically fail back to the primary ISP. I tried implementing this, but cable repairs are unpredictable. Sometimes latency drops for an hour then spikes again. I manually switch back after confirming stability for at least 6 hours. This is a limitation I accept.
Real-World Performance
In the last three months, this system has triggered 7 times. Five were legitimate cable issues that lasted 4-12 hours. Two were false positives during ISP maintenance windows I didn't know about.
The false positives bothered me until I realized they revealed actual connectivity problems, just planned ones. The system is working as designed: it detects degradation and routes around it.
Average failover time from detection to DNS propagation: 8 minutes. That includes the 2-minute check interval, decision logic, API calls, and partial DNS propagation.
I've never had the workflow fail to execute. n8n on my Proxmox VM has been stable. The VM itself has more uptime than my internet connections.
Monitoring the Monitor
I run a separate n8n workflow that checks if the main monitoring workflow is executing. If it hasn't run in 10 minutes, I get a Telegram alert. This has caught two issues:
- The VM running out of disk space (my fault, too many logs)
- n8n crashing after a bad workflow edit I made
Both times, I fixed it within 20 minutes because I knew immediately.
Limitations and Trade-offs
This system only works because I control my DNS and have a backup internet path. If you're using your ISP's DNS or don't have a secondary connection, none of this helps.
It doesn't detect every type of failure. If my home router dies, the workflow can't reach the internet to change DNS. I need to manually switch from another device. Same if the Proxmox host itself fails.
The 5-minute DNS TTL means some users will hit dead endpoints for up to 5 minutes after failover. I can't fix this without going to 60-second TTLs, which would increase DNS query load significantly.
I'm monitoring from a single location (my home network). If the problem is localized to my ISP but not the cable itself, I might fail over unnecessarily. I don't have monitoring from multiple geographic locations because I don't want to pay for that infrastructure.
Cost and Resources
The n8n workflow uses negligible resources. CPU usage is under 1% on the VM. Memory footprint is around 50MB. The workflow executes 720 times per day and generates about 2MB of logs.
Cloudflare API calls are free up to their generous limits. I'm nowhere close to hitting them.
The only real cost is the secondary ISP connection, which I was already paying for as backup internet.
Key Takeaways
You don't need official cable status APIs to build effective failover. Measuring the actual impact (latency to real endpoints) is more useful than knowing which cable is broken.
Rolling averages and flapping prevention are not optional. Without them, you'll create more problems than you solve.
Simple state management (a JSON file) is enough for this use case. Don't over-engineer.
Manual failback is fine. Automation isn't always the answer. Some decisions benefit from human judgment.
Monitor your monitoring. The worst failure is thinking your automation is working when it's not.
This system has saved me from hours of degraded service. It's not perfect, but it's reliable enough that I trust it to run unsupervised. That's the real test.