Building a Docker Compose Stack Health Dashboard with Grafana, cAdvisor, and Container Exit Code Tracking

Why I Built This Dashboard

I run multiple Docker Compose stacks on my Proxmox home lab. Some handle automation workflows, others manage data pipelines, and a few just sit there doing scheduled tasks. The problem wasn’t that containers crashed—it was that I didn’t know they had crashed until something downstream broke.

I needed visibility into container health without constantly SSH-ing into hosts or tailing logs. I wanted to know:

Which containers were running, stopped, or restarting
Resource usage patterns over time
Exit codes when things failed
Historical trends to catch slow degradation

I already had Grafana running for other monitoring tasks, so I decided to build a dashboard specifically for Docker Compose stack health.

My Actual Setup

I run this monitoring stack on a Proxmox LXC container (Debian-based) that hosts several Docker Compose projects. The monitoring components themselves are also deployed via Docker Compose.

Here’s what I used:

cAdvisor – Exposes container-level metrics (CPU, memory, network, filesystem)
Prometheus – Scrapes and stores metrics from cAdvisor
Grafana – Visualizes the data
Docker socket mount – Gives cAdvisor access to container stats

I did not use Node Exporter initially because I only cared about containers, not host-level CPU or disk stats. I added it later when I wanted to correlate container behavior with overall system load.

The Docker Compose File

This is the actual compose file I use (trimmed of unrelated services):

version: '3.8'

services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.0
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v2.45.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.0.3
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

I pinned specific versions because I got burned once by a cAdvisor update that changed metric label names, breaking my dashboard queries.

Prometheus Configuration

The prometheus.yml file tells Prometheus where to scrape metrics:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

I kept the scrape interval at 15 seconds. I tried 5 seconds once, thinking more data would be better, but it just bloated my Prometheus storage without adding useful insights.

What Worked

Container State Tracking

cAdvisor exposes a metric called container_last_seen which shows the last time a container was observed. I used this to detect stopped containers:

time() - container_last_seen{name=~".+"} > 60

This query finds containers that haven’t been seen in the last 60 seconds. It’s not perfect—if Prometheus scraping fails, it triggers false positives—but it works well enough for my setup.

Resource Usage Over Time

I built a panel showing memory usage as a percentage of the container’s limit:

(container_memory_usage_bytes{name=~".+"} / container_spec_memory_limit_bytes{name=~".+"}) * 100

This immediately highlighted a container that was slowly leaking memory. It would restart every few days, and I never noticed the pattern until I saw the graph.

Exit Code Visibility

This was harder than expected. cAdvisor doesn’t directly expose exit codes. I had to query Docker’s own metrics API by mounting /var/run/docker.sock and using a script to parse container inspect data.

I ended up writing a small Python script that runs every 5 minutes via cron:

import docker
import json

client = docker.from_env()

for container in client.containers.list(all=True):
    state = container.attrs['State']
    if state['Status'] == 'exited':
        exit_code = state['ExitCode']
        print(f"{container.name}: exited with code {exit_code}")

I send the output to a log file that I tail in Grafana using the Loki plugin. It’s clunky, but it works. I haven’t found a cleaner way to get exit codes into Prometheus without writing a custom exporter.

Restart Count Alerts

cAdvisor tracks container_start_time_seconds, which resets every time a container restarts. I used this to detect containers that restart frequently:

changes(container_start_time_seconds{name=~".+"}[1h]) > 3

This alerts me if a container restarts more than 3 times in an hour. I set up a Grafana alert that sends a message to my self-hosted n8n webhook, which then posts to a Discord channel.

What Didn’t Work

Docker Compose Labels

I tried using Docker Compose labels to group containers by stack:

labels:
  - "stack=automation"

But cAdvisor doesn’t expose custom labels as Prometheus labels. I had to manually filter by container name patterns instead, which is fragile. If I rename a container, the dashboard breaks.

Network Metrics Per Stack

I wanted to track total network I/O per Compose stack, but cAdvisor only exposes per-container network stats. Summing them in Prometheus works, but the queries get messy when containers share networks or use host networking.

Historical Exit Codes

My cron-based exit code logging only captures the current state. If a container exits and restarts before the script runs, I miss the exit code. I considered using Docker’s event stream API, but that felt like overengineering for my use case.

Filesystem Metrics

cAdvisor exposes container_fs_usage_bytes, but it’s unreliable for bind mounts. I have several containers that mount host directories, and the metrics either show zero or the entire host filesystem size. I stopped trying to make this work and just monitor host disk usage separately.

Key Takeaways

cAdvisor is great for basic container metrics, but it has gaps. Exit codes and custom labels require workarounds.
Mounting the Docker socket is necessary, but it’s a security risk. I only do this on my internal network, never on publicly exposed hosts.
Prometheus retention matters. I set mine to 30 days. Anything longer bloats storage, and I rarely need data older than that.
Grafana alerts are powerful but noisy. I spent weeks tuning thresholds to avoid alert fatigue.
Container name consistency is critical. My dashboard queries rely on predictable naming patterns. Random names break everything.

This setup isn’t perfect, but it gives me the visibility I need to keep my Docker stacks running without constant manual checks. The exit code tracking is still rough, and I’ll probably replace the cron script with a proper exporter eventually, but for now, it works.

Tech Expert & Vibe Coder

Why I Built This Dashboard

My Actual Setup

The Docker Compose File

Prometheus Configuration

What Worked

Container State Tracking

Resource Usage Over Time

Exit Code Visibility

Restart Count Alerts

What Didn’t Work

Docker Compose Labels

Network Metrics Per Stack

Historical Exit Codes

Filesystem Metrics

Key Takeaways

Category:

Debugging Docker Compose...

Building Automated Container...

Leave a Comment Cancel reply

Categories

Related Posts

Debugging Docker Compose Healthcheck Failures...

Building Automated Container Rollback Pipelines: ...

Debugging Container Timezone and Locale...

About Me

Vipin PG

Tech Expert & Vibe Coder

Building a Docker Compose Stack Health Dashboard with Grafana, cAdvisor, and Container Exit Code Tracking

Why I Built This Dashboard

My Actual Setup

The Docker Compose File

Prometheus Configuration

What Worked

Container State Tracking

Resource Usage Over Time

Exit Code Visibility

Restart Count Alerts

What Didn’t Work

Docker Compose Labels

Network Metrics Per Stack

Historical Exit Codes

Filesystem Metrics

Key Takeaways

Category:

Debugging Docker Compose...

Building Automated Container...

Leave a Comment Cancel reply

Subscribe to Newsletter

Categories

Related Posts

Debugging Docker Compose Healthcheck Failures...

Building Automated Container Rollback Pipelines: ...

Debugging Container Timezone and Locale...

About Me

Vipin PG