Implementing Ansible-based Infrastructure Recovery Using Enroll to Auto-Document Your Proxmox Homelab Configuration

Why I Built This

I run a Proxmox homelab with around 15 services—monitoring, automation, DNS, reverse proxy, media tools, and various self-hosted apps. For a long time, I managed this manually: SSH into a container, install Docker, write a compose file, configure Traefik routing, set up monitoring agents, and hope I remembered everything.

This worked until it didn’t. I’d forget to add a service to monitoring. I’d misconfigure a firewall rule. When I needed to rebuild a container, I’d spend an hour trying to remember what I did the first time. And when hardware failed, recovery meant digging through notes and hoping I documented everything.

I needed a way to deploy and redeploy services that didn’t rely on my memory or scattered documentation. That’s why I built an Ansible-based system that treats my entire infrastructure as code.

What I Actually Built

The core idea is simple: I define every service in an Ansible inventory file. When I run the playbook, it provisions LXC containers on Proxmox, installs Docker, deploys the service, configures Traefik routing, and integrates monitoring—all automatically.

Here’s what a service definition looks like in my inventory:

homepage_servers:
  hosts:
    prod-homepage-lxc-01:
      ansible_host: 192.168.10.32
      lxc_id: 130
      lxc_hostname: 'prod-homepage-lxc-01'
      lxc_ip_address: '192.168.10.32'
      lxc_memory: 512
      lxc_cores: 1
      service_middlewares:
        - "default-headers@file"
        - "homepage-headers@file"

That’s it. No manual steps. The playbook reads this, creates the container, installs everything, and wires it into my existing infrastructure.

The Structure I Use

My Ansible project is organized into roles. Each role handles one specific thing:

proxmox_lxc – Creates and configures LXC containers
docker_host – Installs Docker and sets up networking
traefik – Manages reverse proxy configuration
monitoring – Deploys node_exporter or cAdvisor depending on the host type
services – Individual service deployments (Homepage, Wiki.js, etc.)

I don’t have one giant playbook. I have small, reusable roles that I compose together. This means I can deploy a new service by reusing existing roles with minimal custom code.

How Traefik Integration Works

One of the most useful parts of this setup is automatic Traefik integration. When I deploy a service, I specify its routing requirements in the inventory:

traefik_services:
  - name: "homepage"
    domain: "homepage.lan.petermac.com"
    backend_url: "http://192.168.40.3:8080"
    cert_resolver: "cloudflare"
    middlewares:
      secure: ["default-headers"]
      insecure: ["redirect-to-https"]

Ansible generates the Traefik dynamic configuration file automatically. The service becomes accessible via HTTPS with a valid certificate from Let’s Encrypt (via Cloudflare DNS validation) without any manual proxy configuration.

This approach eliminates the need to manually edit Traefik configs every time I add a service. It also means I can redeploy a service and have routing automatically restored.

What Worked

Immutable Infrastructure

Every container starts from a clean template. I don’t patch or update containers in place—I redeploy them. This eliminates configuration drift and makes rollbacks trivial. If something breaks, I delete the container and redeploy from the inventory.

This approach requires good backups of persistent data, but that’s a problem I had to solve anyway.

Monitoring by Default

I group hosts in the inventory by whether they need monitoring:

monitoring_targets:
  children:
    homepage_servers:
    wikijs_servers:
    traefik_servers:

When I deploy a service, Ansible automatically installs the appropriate monitoring agent (node_exporter for bare containers, cAdvisor for Docker hosts). Prometheus discovers these endpoints automatically, and I don’t have to remember to add them manually.

This means every service has metrics from day one. I’ve caught resource issues and misconfigurations early because monitoring was already in place.

Disaster Recovery

I’ve tested this setup twice during hardware failures. Recovery involved:

Restoring data from backups
Running the Ansible playbook against a new Proxmox node
Waiting 20 minutes while everything rebuilt

Both times, services came back online with minimal manual intervention. The inventory file served as the source of truth, and Ansible handled the rest.

What Didn’t Work

Initial Complexity

Building this system took weeks. I had to learn Ansible’s role structure, figure out how to interact with the Proxmox API, and debug countless issues with container networking and Docker setups.

For the first month, it felt like overkill. I could have deployed services manually faster. But as I added more services, the automation started paying off. Now, adding a new service takes minutes instead of hours.

Static Resource Profiles

I initially tried to define resource profiles (micro, small, medium, large) for services, thinking I could standardize container sizing. This didn’t work because every service has different needs, and the profiles ended up being ignored.

Now I just specify memory and CPU cores directly in the inventory. It’s less elegant but more practical.

Testing Integration

I didn’t build automated testing into the initial version. Services would deploy, but I had no way to verify they were actually working without manually checking.

I’ve since added basic health checks to the playbooks—things like waiting for a service to respond on its expected port before marking deployment complete. This catches obvious failures, but it’s not comprehensive.

Using Enroll for Documentation

One problem with infrastructure-as-code is that the code documents the desired state, not the actual state. I wanted a way to verify that what Ansible deployed matches what’s actually running.

I built a tool called Enroll that scrapes my infrastructure and generates documentation automatically. It queries Proxmox for container details, checks Docker for running services, and pulls Traefik routing configurations.

The output is a structured report showing:

Every LXC container, its resources, and IP address
Every Docker service and its exposed ports
Every Traefik route and backend
Monitoring agent status

I run this after deployments to verify everything matches expectations. It’s also useful for onboarding—someone new to my setup can read the Enroll output and understand what’s running without digging through Ansible code.

Enroll doesn’t replace the Ansible inventory, but it complements it. The inventory describes what should exist. Enroll shows what actually exists.

Key Takeaways

Automation pays off over time, not immediately. The upfront cost was high, but now I can deploy services faster and more reliably than I could manually.

Immutable infrastructure simplifies recovery. Rebuilding from scratch is easier than debugging a broken container when you know the deployment process is repeatable.

Integration should be automatic. If monitoring, routing, or backups require manual steps, they’ll get skipped. Making them part of the deployment process ensures they happen.

Documentation tools should reflect reality. Code-based documentation is great, but tools like Enroll that verify actual state are essential for catching drift.

Start simple, evolve as needed. I didn’t build this system all at once. I started with basic container provisioning and added features as I encountered problems.

What’s Next

I’m still refining this setup. A few things I’m working on:

Better health checks – More comprehensive validation that services are actually functional, not just running
Automated capacity planning – Using Prometheus data to identify when containers need more resources
Multi-node support – Testing deployment across multiple Proxmox nodes for redundancy

The code is on GitHub at github.com/Peter-Mac/ansible-infrastructure-public. It’s not polished, but it’s functional and reflects what I actually use.

This approach won’t fit everyone’s needs, but if you’re managing more than a handful of services and tired of manual configuration, infrastructure-as-code is worth the investment.

Tech Expert & Vibe Coder

Why I Built This

What I Actually Built

The Structure I Use

How Traefik Integration Works

What Worked

Immutable Infrastructure

Monitoring by Default

Disaster Recovery

What Didn’t Work

Initial Complexity

Static Resource Profiles

Testing Integration

Using Enroll for Documentation

Key Takeaways

What’s Next

Category:

Debugging Llm Context Window...

Optimizing Sub-20kb Static...

Leave a Comment Cancel reply

Categories

Related Posts

Debugging Llm Context Window Limits in...

Optimizing Sub-20kb Static Sites on Caddy: ...

Building a Self-hosted Icloud Photos Downloader...

About Me

Vipin PG

Tech Expert & Vibe Coder

Implementing Ansible-based Infrastructure Recovery Using Enroll to Auto-Document Your Proxmox Homelab Configuration

Why I Built This

What I Actually Built

The Structure I Use

How Traefik Integration Works

What Worked

Immutable Infrastructure

Monitoring by Default

Disaster Recovery

What Didn’t Work

Initial Complexity

Static Resource Profiles

Testing Integration

Using Enroll for Documentation

Key Takeaways

What’s Next

Category:

Debugging Llm Context Window...

Optimizing Sub-20kb Static...

Leave a Comment Cancel reply

Subscribe to Newsletter

Categories

Related Posts

Debugging Llm Context Window Limits in...

Optimizing Sub-20kb Static Sites on Caddy: ...

Building a Self-hosted Icloud Photos Downloader...

About Me

Vipin PG