Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Configuring Docker Swarm Secrets Rotation with Vault Integration for Zero-Downtime Database Credential Updates

Why I Worked on This

I run several Docker Swarm clusters where database credentials need regular rotation. My initial setup used static secrets baked into Docker configs, which meant downtime every time I rotated a password. When a database credential leaked in logs once, I had to manually update every service that used it. That process took hours and involved service restarts I couldn't fully control.

I needed a way to rotate secrets without touching service definitions or causing outages. HashiCorp Vault was already running in my infrastructure for other projects, so I looked for a way to bridge Vault's dynamic secrets into Docker Swarm's native secret system.

My Real Setup

I'm running a three-node Docker Swarm cluster on Proxmox VMs. Vault sits outside the swarm as a standalone service with its own TLS setup. My database credentials live in Vault's KV v2 engine under paths like database/mysql and database/postgres.

The swarm-external-secrets plugin runs as a Docker plugin on each manager node. It acts as a secrets driver, meaning Docker Swarm asks the plugin for secret values instead of storing them directly. The plugin fetches those values from Vault at runtime.

My docker-compose.yml defines secrets like this:

secrets:
  mysql_password:
    driver: vault-secrets-plugin:latest
    labels:
      vault_path: "database/mysql"
      vault_field: "password"

Services reference these secrets normally. Docker Swarm mounts them as files in /run/secrets/, but the plugin handles the actual retrieval from Vault.

What Worked

The plugin's rotation feature monitors Vault for changes by comparing SHA256 hashes of secret values. I set VAULT_ROTATION_INTERVAL to 2 minutes during testing, though I use 5 minutes in production.

When I update a secret in Vault using vault kv put secret/database/mysql password=new_password, the plugin detects the change within the next interval. It creates a new version of the Docker secret and forces a service update. Docker Swarm's rolling update mechanism handles the rest—containers restart gracefully with the new secret value.

This eliminated manual intervention. I can rotate credentials in Vault, and services pick up the change automatically. No downtime, no manual restarts, no forgotten services still using old credentials.

The plugin also cleans up old secret versions, which prevents Docker from accumulating stale data. This was important because I initially hit Docker's secret version limit during rapid testing.

Authentication Setup

I use Vault's AppRole authentication for the plugin. I created a policy that grants read access only to the specific KV paths my services need. The plugin authenticates once at startup and renews its token automatically.

Setting this up required creating the AppRole in Vault, binding the policy, and then configuring the plugin with the role ID and secret ID as environment variables. This felt more secure than using a root token, which I've seen in some examples online.

Monitoring

The plugin logs rotation events to Docker's logging system. I pipe these through my existing Loki setup, which lets me track when secrets change and correlate that with service restarts.

I also check plugin status with journalctl -u docker.service -f | grep vault when debugging. This shows real-time activity, including Vault connection errors or hash mismatches.

What Didn't Work

The first issue I hit was running the plugin on worker nodes. The plugin needs access to Docker's API to manage secrets and trigger service updates. Worker nodes don't have that access by default. I had to ensure the plugin only runs on manager nodes, which meant adjusting my deployment strategy.

Another problem was service restart behavior. Not all applications handle secret changes gracefully. My PostgreSQL containers, for example, don't reload credentials without a full restart. This meant I had to configure services with update_config settings that forced container replacement rather than just signaling the process.

I also initially set the rotation interval too low—30 seconds. This caused excessive Vault API calls and triggered rate limits during peak usage. Vault started rejecting requests, which broke secret retrieval for new deployments. I increased the interval to 5 minutes, which balanced responsiveness with API load.

The plugin doesn't support nested JSON fields in Vault secrets. If my secret structure looks like {"db": {"password": "value"}}, I can't extract db.password directly. I had to flatten my Vault secret structure to work around this.

Docker Socket Permissions

The plugin requires access to /var/run/docker.sock to manage secrets. This felt risky at first because it grants broad Docker API access. I mitigated this by running the plugin only on manager nodes and using Docker's built-in access controls. I also monitor plugin logs for any unexpected API calls.

Key Takeaways

This setup works well for services that can handle graceful restarts. If your application requires zero-downtime credential rotation at the process level, you'll need application-level logic to reload secrets without restarting.

The plugin's rotation interval is a trade-off. Shorter intervals mean faster propagation of new secrets but more load on Vault. I settled on 5 minutes, which gives me reasonable response time without hammering Vault's API.

Running this on manager nodes only is non-negotiable. Worker nodes can't manage secrets, so the plugin won't function there. Plan your swarm topology accordingly.

Vault's KV v2 engine works best for this use case because it supports versioning. If you're using KV v1, you lose the ability to track changes via version metadata, which the plugin relies on for hash comparison.

Finally, test your rotation setup in a non-production environment first. I caught several issues—like the restart behavior and rate limiting—that would have caused outages if I'd deployed directly to production.

Current State

I've been running this setup for several months now. Database credentials rotate weekly via a scheduled Vault policy. The plugin handles propagation automatically, and I haven't had a single downtime incident related to secret rotation since deploying it.

The monitoring logs show consistent rotation activity, and I've tuned the interval to balance responsiveness with resource usage. This approach removed a significant operational burden from my workflow.