Why I Built This Dashboard
I run multiple Docker Compose stacks on my Proxmox home lab. Some handle automation workflows, others manage data pipelines, and a few just sit there doing scheduled tasks. The problem wasn't that containers crashed—it was that I didn't know they had crashed until something downstream broke.
I needed visibility into container health without constantly SSH-ing into hosts or tailing logs. I wanted to know:
- Which containers were running, stopped, or restarting
- Resource usage patterns over time
- Exit codes when things failed
- Historical trends to catch slow degradation
I already had Grafana running for other monitoring tasks, so I decided to build a dashboard specifically for Docker Compose stack health.
My Actual Setup
I run this monitoring stack on a Proxmox LXC container (Debian-based) that hosts several Docker Compose projects. The monitoring components themselves are also deployed via Docker Compose.
Here's what I used:
- cAdvisor – Exposes container-level metrics (CPU, memory, network, filesystem)
- Prometheus – Scrapes and stores metrics from cAdvisor
- Grafana – Visualizes the data
- Docker socket mount – Gives cAdvisor access to container stats
I did not use Node Exporter initially because I only cared about containers, not host-level CPU or disk stats. I added it later when I wanted to correlate container behavior with overall system load.
The Docker Compose File
This is the actual compose file I use (trimmed of unrelated services):
version: '3.8'
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.0
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg
restart: unless-stopped
prometheus:
image: prom/prometheus:v2.45.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
restart: unless-stopped
grafana:
image: grafana/grafana:10.0.3
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
volumes:
prometheus-data:
grafana-data:
I pinned specific versions because I got burned once by a cAdvisor update that changed metric label names, breaking my dashboard queries.
Prometheus Configuration
The prometheus.yml file tells Prometheus where to scrape metrics:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
I kept the scrape interval at 15 seconds. I tried 5 seconds once, thinking more data would be better, but it just bloated my Prometheus storage without adding useful insights.
What Worked
Container State Tracking
cAdvisor exposes a metric called container_last_seen which shows the last time a container was observed. I used this to detect stopped containers:
time() - container_last_seen{name=~".+"} > 60
This query finds containers that haven't been seen in the last 60 seconds. It's not perfect—if Prometheus scraping fails, it triggers false positives—but it works well enough for my setup.
Resource Usage Over Time
I built a panel showing memory usage as a percentage of the container's limit:
(container_memory_usage_bytes{name=~".+"} / container_spec_memory_limit_bytes{name=~".+"}) * 100
This immediately highlighted a container that was slowly leaking memory. It would restart every few days, and I never noticed the pattern until I saw the graph.
Exit Code Visibility
This was harder than expected. cAdvisor doesn't directly expose exit codes. I had to query Docker's own metrics API by mounting /var/run/docker.sock and using a script to parse container inspect data.
I ended up writing a small Python script that runs every 5 minutes via cron:
import docker
import json
client = docker.from_env()
for container in client.containers.list(all=True):
state = container.attrs['State']
if state['Status'] == 'exited':
exit_code = state['ExitCode']
print(f"{container.name}: exited with code {exit_code}")
I send the output to a log file that I tail in Grafana using the Loki plugin. It's clunky, but it works. I haven't found a cleaner way to get exit codes into Prometheus without writing a custom exporter.
Restart Count Alerts
cAdvisor tracks container_start_time_seconds, which resets every time a container restarts. I used this to detect containers that restart frequently:
changes(container_start_time_seconds{name=~".+"}[1h]) > 3
This alerts me if a container restarts more than 3 times in an hour. I set up a Grafana alert that sends a message to my self-hosted n8n webhook, which then posts to a Discord channel.
What Didn't Work
Docker Compose Labels
I tried using Docker Compose labels to group containers by stack:
labels:
- "stack=automation"
But cAdvisor doesn't expose custom labels as Prometheus labels. I had to manually filter by container name patterns instead, which is fragile. If I rename a container, the dashboard breaks.
Network Metrics Per Stack
I wanted to track total network I/O per Compose stack, but cAdvisor only exposes per-container network stats. Summing them in Prometheus works, but the queries get messy when containers share networks or use host networking.
Historical Exit Codes
My cron-based exit code logging only captures the current state. If a container exits and restarts before the script runs, I miss the exit code. I considered using Docker's event stream API, but that felt like overengineering for my use case.
Filesystem Metrics
cAdvisor exposes container_fs_usage_bytes, but it's unreliable for bind mounts. I have several containers that mount host directories, and the metrics either show zero or the entire host filesystem size. I stopped trying to make this work and just monitor host disk usage separately.
Key Takeaways
- cAdvisor is great for basic container metrics, but it has gaps. Exit codes and custom labels require workarounds.
- Mounting the Docker socket is necessary, but it's a security risk. I only do this on my internal network, never on publicly exposed hosts.
- Prometheus retention matters. I set mine to 30 days. Anything longer bloats storage, and I rarely need data older than that.
- Grafana alerts are powerful but noisy. I spent weeks tuning thresholds to avoid alert fatigue.
- Container name consistency is critical. My dashboard queries rely on predictable naming patterns. Random names break everything.
This setup isn't perfect, but it gives me the visibility I need to keep my Docker stacks running without constant manual checks. The exit code tracking is still rough, and I'll probably replace the cron script with a proper exporter eventually, but for now, it works.