Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Building a Self-Healing n8n Workflow for Failed Docker Container Restarts

Automation 7 min read Published Mar 21, 2026

Why I Built This

I run multiple Docker containers on my Proxmox setup—n8n itself, various automation workers, monitoring tools, and database services. Most of the time, Docker’s built-in restart policies handle crashes fine. The container dies, Docker notices, and it comes back up.

But I kept running into a specific problem: containers that became unhealthy but didn’t actually exit. The process was still running, Docker marked it as unhealthy based on failed health checks, but nothing restarted it. I’d find these zombie containers days later, silently broken while everything around them kept running.

I needed something that could detect unhealthy containers and restart them automatically. I already had n8n running for other automation tasks, so I built a workflow to handle this monitoring and recovery.

My Setup Context

This workflow runs on my primary Proxmox node where I host most of my Docker containers. The relevant pieces:

  • n8n running in Docker Compose with PostgreSQL backend
  • Docker socket mounted into the n8n container for API access
  • Health checks configured on critical containers (databases, reverse proxy, monitoring tools)
  • All containers use the “unless-stopped” restart policy

The key configuration change I made was mounting the Docker socket into n8n’s container. In my docker-compose.yml, I added this volume mount:

volumes:
  - /var/run/docker.sock:/var/run/docker.sock:ro

This gives n8n read access to the Docker API through the Unix socket. I kept it read-only initially, but quickly realized I needed write access to actually restart containers, so I removed the “:ro” flag. This is a security consideration—n8n can now control Docker—but since this is my internal infrastructure and n8n is already behind authentication, I accepted that trade-off.

The Workflow I Built

The n8n workflow runs every 2 minutes on a schedule trigger. Here’s what it does:

Step 1: Query Docker for Unhealthy Containers

I use an Execute Command node to run:

docker ps --filter "health=unhealthy" --format "{{.ID}}|{{.Names}}|{{.Status}}"

This returns a pipe-delimited list of any containers currently marked unhealthy. The format flag gives me just the data I need: container ID, name, and current status.

Step 2: Parse and Split the Output

The command output comes back as a single string with one container per line. I added a Code node to split this into individual items:

const output = $input.first().json.stdout;

if (!output || output.trim() === '') {
  return [];
}

const containers = output.trim().split('\n').map(line => {
  const [id, name, status] = line.split('|');
  return {
    json: {
      containerId: id,
      containerName: name,
      status: status
    }
  };
});

return containers;

This creates one workflow item per unhealthy container, which lets me process them individually.

Step 3: Filter Out n8n Itself

Early on, I created a circular problem: if n8n became unhealthy, the workflow would try to restart it, killing the workflow mid-execution. I added a Filter node that excludes any container with “n8n” in the name:

{{ $json.containerName.includes('n8n') }} is false

If n8n fails, I handle that separately with an external monitoring script on the host.

Step 4: Restart the Container

For each remaining unhealthy container, I use another Execute Command node:

docker restart {{ $json.containerId }}

This issues the restart command using the container ID from the parsed data.

Step 5: Log the Action

I send a notification to a Discord webhook with the container name and timestamp. This gives me a record of what got restarted and when. The HTTP Request node posts:

{
  "content": "Container restarted: {{ $json.containerName }} at {{ $now.format('YYYY-MM-DD HH:mm:ss') }}"
}

I keep these logs in a dedicated Discord channel that I can search later if I need to track patterns.

What Worked

The workflow catches unhealthy containers reliably. I’ve had it running for about four months now, and it’s restarted containers roughly 20-25 times in that period—mostly my Postgres container during high load, and occasionally a custom monitoring tool I built that has memory leak issues.

The 2-minute schedule strikes a good balance. It’s frequent enough to catch problems quickly but not so aggressive that it hammers the system. I tested 30-second intervals initially, but that felt excessive and added unnecessary noise to the logs.

Mounting the Docker socket works cleanly. I was concerned about permission issues, but since n8n runs as the node user and I set appropriate group permissions on the socket, it has exactly the access it needs.

The Discord notifications turned out to be more valuable than I expected. I can see restart patterns—like my Postgres container becoming unhealthy during backup operations—and that’s helped me tune other parts of my setup.

What Didn’t Work

My first attempt used the Docker HTTP API directly instead of the socket. I tried to query http://localhost:2375/containers/json?filters={"health":["unhealthy"]} but couldn’t get the filters to work properly. The JSON formatting in n8n’s HTTP Request node kept breaking, and I wasted time debugging URL encoding issues. Switching to the socket and using the CLI was simpler.

I initially tried to restart containers using docker-compose restart thinking it would be cleaner, but that only works if you’re in the directory with the compose file and know the service name. Since I run containers across different compose files and standalone containers, using docker restart with the container ID was more reliable.

The workflow doesn’t handle cascading failures well. If multiple dependent containers fail at once—like a database and the service that depends on it—the workflow restarts them all simultaneously. Sometimes the dependent service comes up before the database is ready and just fails again. I haven’t fixed this yet because it’s rare enough that I just let the workflow handle it on the next cycle.

I tried adding a “grace period” check where the workflow would only restart containers that had been unhealthy for at least 5 minutes, thinking this would prevent premature restarts. But this added complexity—I had to track state between workflow runs—and I realized most of my unhealthy containers don’t recover on their own anyway. I removed that logic.

Security Considerations I Made

Giving n8n access to the Docker socket is effectively giving it root access to the host. Any workflow can now start, stop, or modify containers. I’m comfortable with this on my internal network where n8n is behind authentication and not exposed to the internet, but I wouldn’t do this on a shared system or anywhere untrusted users could create workflows.

I kept the socket mount read-only initially and used a separate privileged container for restart operations, but that added complexity without meaningful security improvement. If someone compromises n8n, they can execute arbitrary commands through workflow nodes anyway.

The workflow logs container names and IDs to Discord. I made sure this channel isn’t public and doesn’t contain any sensitive environment variables or configuration details.

Current Limitations

This workflow doesn’t help if Docker itself crashes or the entire host system goes down. For that, I rely on Proxmox’s built-in VM recovery and manual intervention.

It also can’t fix containers that become unhealthy due to external dependencies. If a container is unhealthy because it can’t reach an external API, restarting it doesn’t help. The workflow will keep restarting it every 2 minutes until I intervene manually.

The workflow has no concept of “restart fatigue.” If a container is fundamentally broken and keeps failing health checks, this will restart it indefinitely. I’ve thought about adding a counter or rate limit, but haven’t needed it yet.

Key Takeaways

Docker’s health checks are useful, but they don’t trigger automatic restarts by themselves. You need something watching for unhealthy containers and taking action.

Using n8n for this made sense because I already run it for other automation. Building a separate monitoring service would have been overkill for my needs.

The Docker socket approach is simple and works reliably. It’s a security decision worth thinking through, but for internal infrastructure, it’s practical.

Logging restarts to an external system—like Discord or a file—helps spot patterns I wouldn’t otherwise notice. Some containers have issues I can fix; others just need to be restarted occasionally, and knowing which is which helps me prioritize.

Self-healing workflows don’t eliminate the need for proper monitoring and root cause analysis. They buy you time and reduce manual intervention, but broken containers are still broken for a reason.

Previous article

Creating Custom Ollama Modelfiles for Domain-Specific Fine-Tuned Models

Next article

Setting Up Gotify Push Notifications for Cron Job Failures Across Multiple Servers

Leave a Comment

Your email address will not be published. Required fields are marked *

Search Articles

Jump to another topic without leaving the reading flow.

Categories

Browse more posts grouped by topic.

About the Author

Vipin PG

Vipin PG

Expert Tech Support & Services

Vipin PG is a software professional with 15+ years of hands-on experience in system infrastructure, browser performance, and AI-powered development. Holding an MCA from Kerala University, he has worked across enterprises in Dubai and Kochi before running his independent tech consultancy. He has written 180+ tutorials on Docker, networking, and system troubleshooting - and he actually runs the setups he writes about.

Stay Updated

Get new posts and practical tech notes in your inbox.

Short, high-signal updates covering self-hosting, automation, AI tooling, and infrastructure fixes.