Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Implementing Ansible-based Infrastructure Recovery Using Enroll to Auto-Document Your Proxmox Homelab Configuration

Why I Built This

I run a Proxmox homelab with around 15 services—monitoring, automation, DNS, reverse proxy, media tools, and various self-hosted apps. For a long time, I managed this manually: SSH into a container, install Docker, write a compose file, configure Traefik routing, set up monitoring agents, and hope I remembered everything.

This worked until it didn't. I'd forget to add a service to monitoring. I'd misconfigure a firewall rule. When I needed to rebuild a container, I'd spend an hour trying to remember what I did the first time. And when hardware failed, recovery meant digging through notes and hoping I documented everything.

I needed a way to deploy and redeploy services that didn't rely on my memory or scattered documentation. That's why I built an Ansible-based system that treats my entire infrastructure as code.

What I Actually Built

The core idea is simple: I define every service in an Ansible inventory file. When I run the playbook, it provisions LXC containers on Proxmox, installs Docker, deploys the service, configures Traefik routing, and integrates monitoring—all automatically.

Here's what a service definition looks like in my inventory:

homepage_servers:
  hosts:
    prod-homepage-lxc-01:
      ansible_host: 192.168.10.32
      lxc_id: 130
      lxc_hostname: 'prod-homepage-lxc-01'
      lxc_ip_address: '192.168.10.32'
      lxc_memory: 512
      lxc_cores: 1
      service_middlewares:
        - "default-headers@file"
        - "homepage-headers@file"

That's it. No manual steps. The playbook reads this, creates the container, installs everything, and wires it into my existing infrastructure.

The Structure I Use

My Ansible project is organized into roles. Each role handles one specific thing:

  • proxmox_lxc – Creates and configures LXC containers
  • docker_host – Installs Docker and sets up networking
  • traefik – Manages reverse proxy configuration
  • monitoring – Deploys node_exporter or cAdvisor depending on the host type
  • services – Individual service deployments (Homepage, Wiki.js, etc.)

I don't have one giant playbook. I have small, reusable roles that I compose together. This means I can deploy a new service by reusing existing roles with minimal custom code.

How Traefik Integration Works

One of the most useful parts of this setup is automatic Traefik integration. When I deploy a service, I specify its routing requirements in the inventory:

traefik_services:
  - name: "homepage"
    domain: "homepage.lan.petermac.com"
    backend_url: "http://192.168.40.3:8080"
    cert_resolver: "cloudflare"
    middlewares:
      secure: ["default-headers"]
      insecure: ["redirect-to-https"]

Ansible generates the Traefik dynamic configuration file automatically. The service becomes accessible via HTTPS with a valid certificate from Let's Encrypt (via Cloudflare DNS validation) without any manual proxy configuration.

This approach eliminates the need to manually edit Traefik configs every time I add a service. It also means I can redeploy a service and have routing automatically restored.

What Worked

Immutable Infrastructure

Every container starts from a clean template. I don't patch or update containers in place—I redeploy them. This eliminates configuration drift and makes rollbacks trivial. If something breaks, I delete the container and redeploy from the inventory.

This approach requires good backups of persistent data, but that's a problem I had to solve anyway.

Monitoring by Default

I group hosts in the inventory by whether they need monitoring:

monitoring_targets:
  children:
    homepage_servers:
    wikijs_servers:
    traefik_servers:

When I deploy a service, Ansible automatically installs the appropriate monitoring agent (node_exporter for bare containers, cAdvisor for Docker hosts). Prometheus discovers these endpoints automatically, and I don't have to remember to add them manually.

This means every service has metrics from day one. I've caught resource issues and misconfigurations early because monitoring was already in place.

Disaster Recovery

I've tested this setup twice during hardware failures. Recovery involved:

  1. Restoring data from backups
  2. Running the Ansible playbook against a new Proxmox node
  3. Waiting 20 minutes while everything rebuilt

Both times, services came back online with minimal manual intervention. The inventory file served as the source of truth, and Ansible handled the rest.

What Didn't Work

Initial Complexity

Building this system took weeks. I had to learn Ansible's role structure, figure out how to interact with the Proxmox API, and debug countless issues with container networking and Docker setups.

For the first month, it felt like overkill. I could have deployed services manually faster. But as I added more services, the automation started paying off. Now, adding a new service takes minutes instead of hours.

Static Resource Profiles

I initially tried to define resource profiles (micro, small, medium, large) for services, thinking I could standardize container sizing. This didn't work because every service has different needs, and the profiles ended up being ignored.

Now I just specify memory and CPU cores directly in the inventory. It's less elegant but more practical.

Testing Integration

I didn't build automated testing into the initial version. Services would deploy, but I had no way to verify they were actually working without manually checking.

I've since added basic health checks to the playbooks—things like waiting for a service to respond on its expected port before marking deployment complete. This catches obvious failures, but it's not comprehensive.

Using Enroll for Documentation

One problem with infrastructure-as-code is that the code documents the desired state, not the actual state. I wanted a way to verify that what Ansible deployed matches what's actually running.

I built a tool called Enroll that scrapes my infrastructure and generates documentation automatically. It queries Proxmox for container details, checks Docker for running services, and pulls Traefik routing configurations.

The output is a structured report showing:

  • Every LXC container, its resources, and IP address
  • Every Docker service and its exposed ports
  • Every Traefik route and backend
  • Monitoring agent status

I run this after deployments to verify everything matches expectations. It's also useful for onboarding—someone new to my setup can read the Enroll output and understand what's running without digging through Ansible code.

Enroll doesn't replace the Ansible inventory, but it complements it. The inventory describes what should exist. Enroll shows what actually exists.

Key Takeaways

Automation pays off over time, not immediately. The upfront cost was high, but now I can deploy services faster and more reliably than I could manually.

Immutable infrastructure simplifies recovery. Rebuilding from scratch is easier than debugging a broken container when you know the deployment process is repeatable.

Integration should be automatic. If monitoring, routing, or backups require manual steps, they'll get skipped. Making them part of the deployment process ensures they happen.

Documentation tools should reflect reality. Code-based documentation is great, but tools like Enroll that verify actual state are essential for catching drift.

Start simple, evolve as needed. I didn't build this system all at once. I started with basic container provisioning and added features as I encountered problems.

What's Next

I'm still refining this setup. A few things I'm working on:

  • Better health checks – More comprehensive validation that services are actually functional, not just running
  • Automated capacity planning – Using Prometheus data to identify when containers need more resources
  • Multi-node support – Testing deployment across multiple Proxmox nodes for redundancy

The code is on GitHub at github.com/Peter-Mac/ansible-infrastructure-public. It's not polished, but it's functional and reflects what I actually use.

This approach won't fit everyone's needs, but if you're managing more than a handful of services and tired of manual configuration, infrastructure-as-code is worth the investment.