Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Setting Up Prometheus Alerts for GPU Temperature Spikes in Self-Hosted LLM Containers Running on Consumer Hardware

Why I Set This Up

I run local LLM inference on consumer GPUs in Docker containers on my Proxmox host. These aren't datacenter cards—they're gaming GPUs repurposed for AI work. The problem I kept hitting was thermal throttling during long inference runs. The card would spike to 85°C+, performance would drop, and I wouldn't know until the model started responding slowly or the container logs showed thermal warnings.

I needed automated alerts before things got critical. Not enterprise monitoring with a full observability stack—just reliable temperature tracking that would notify me when a GPU was running too hot.

My Setup

Here's what I'm working with:

  • Proxmox host running Docker containers for LLM inference (Ollama, text-generation-webui)
  • NVIDIA RTX 3090 passed through to containers via nvidia-container-toolkit
  • Prometheus already running in a container for other host metrics
  • No Kubernetes—this is plain Docker on a single node

The goal was simple: scrape GPU temperature metrics and send alerts when thresholds are crossed.

DCGM Exporter for GPU Metrics

NVIDIA's DCGM (Data Center GPU Manager) includes an exporter that exposes GPU metrics in Prometheus format. Even though my setup isn't a datacenter, DCGM works fine on consumer cards.

I deployed it as a Docker container alongside my LLM containers:

docker run -d \
  --name dcgm-exporter \
  --gpus all \
  --restart unless-stopped \
  -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu22.04

Key points from my experience:

  • The --gpus all flag is required to expose GPU access to the container
  • Port 9400 is the default metrics endpoint
  • The image version matters—older versions had issues with RTX 30-series cards

Once running, I verified metrics were available:

curl http://localhost:9400/metrics | grep temperature

This returned several temperature metrics. The one I needed was DCGM_FI_DEV_GPU_TEMP, which reports the current GPU temperature in Celsius.

Configuring Prometheus Scraping

I added the DCGM exporter as a scrape target in my Prometheus config. Since I run Prometheus in Docker too, I edited the prometheus.yml mounted into the container:

scrape_configs:
  - job_name: 'gpu-metrics'
    scrape_interval: 5s
    static_configs:
      - targets: ['host.docker.internal:9400']
        labels:
          instance: 'proxmox-gpu-node'

Notes on this config:

  • scrape_interval: 5s checks temperature every 5 seconds. I tried 1s initially but it was overkill and added unnecessary load
  • host.docker.internal is how Docker containers reference the host's localhost. On Linux, I had to add --add-host=host.docker.internal:host-gateway to the Prometheus container's run command
  • The instance label helps when I eventually add more nodes

After restarting Prometheus, I confirmed the target was up in the Prometheus UI at http://localhost:9090/targets.

Creating Temperature Alert Rules

Prometheus alert rules are defined in a separate file. I created gpu_alerts.yml and mounted it into the Prometheus container:

groups:
  - name: gpu_temperature
    interval: 10s
    rules:
      - alert: GPUTemperatureHigh
        expr: DCGM_FI_DEV_GPU_TEMP > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "GPU temperature high on {{ $labels.instance }}"
          description: "GPU temperature is {{ $value }}°C (threshold: 80°C)"

      - alert: GPUTemperatureCritical
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "GPU temperature critical on {{ $labels.instance }}"
          description: "GPU temperature is {{ $value }}°C (threshold: 85°C)"

Why these thresholds:

  • 80°C warning: My RTX 3090 starts thermal throttling around 82-83°C. I wanted advance notice
  • 85°C critical: At this point, performance is degraded and the card is close to its thermal limit
  • for: 2m on the warning prevents alerts during brief spikes (like model loading)
  • for: 30s on critical because if it hits 85°C, I want to know immediately

I referenced this file in the main Prometheus config:

rule_files:
  - '/etc/prometheus/gpu_alerts.yml'

Setting Up Alertmanager

Prometheus detects alert conditions, but Alertmanager handles notifications. I run Alertmanager in another Docker container:

docker run -d \
  --name alertmanager \
  --restart unless-stopped \
  -p 9093:9093 \
  -v /opt/alertmanager/config.yml:/etc/alertmanager/config.yml \
  prom/alertmanager:latest

My Alertmanager config sends alerts to a Discord webhook (I tried email first, but relay setup was a pain):

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'discord'

receivers:
  - name: 'discord'
    webhook_configs:
      - url: 'http://host.docker.internal:9094/webhook'

Discord doesn't directly support Prometheus webhook format, so I run a small bridge service (prometheus-discord-bridge) that translates and forwards to Discord. It's a third container, but it works reliably.

In my Prometheus config, I pointed to Alertmanager:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['host.docker.internal:9093']

What Worked

After running this setup for several weeks:

  • Alerts fire consistently when temperature crosses thresholds
  • The 2-minute delay on warnings eliminated false positives from brief load spikes
  • I caught two cases where my LLM container's cooling wasn't adequate—the alerts let me adjust fan curves before thermal throttling became a pattern
  • DCGM exporter has been stable with no restarts needed

The Discord notifications work well for my use case. I get a ping on my phone, can check Grafana if needed, and decide whether to intervene.

What Didn't Work

Initial attempts had issues:

Wrong DCGM version: I first tried version 2.x of the exporter. It wouldn't recognize my RTX 3090 correctly and reported zero for most metrics. Upgrading to 3.x fixed this.

Scrape interval too aggressive: I started with 1-second scraping. Prometheus CPU usage jumped noticeably, and I was generating far more data than I needed. 5 seconds is plenty for temperature monitoring.

Alert fatigue from no grouping: My first Alertmanager config didn't group alerts. During a sustained high-temp period, I got spammed with repeat notifications. Adding group_by and repeat_interval fixed this.

Host networking confusion: Getting containers to talk to each other and the host took trial and error. host.docker.internal works on Docker Desktop automatically, but on plain Docker Engine (which I use), I had to explicitly add the host-gateway mapping.

Limitations and Trade-offs

This setup has clear boundaries:

  • It only monitors temperature. Other GPU metrics (power draw, memory usage, utilization) are available from DCGM but I'm not alerting on them yet
  • No automatic remediation. The alert tells me there's a problem, but I have to manually check container logs, adjust cooling, or stop inference jobs
  • Single point of failure: if the DCGM exporter container dies, I lose all GPU visibility. I should add a meta-alert for missing metrics
  • Consumer GPU support in DCGM is unofficial. NVIDIA documents this for datacenter cards, and while it works on RTX, there's no guarantee future driver updates won't break something

Key Takeaways

From actually running this system:

  • DCGM exporter works fine on consumer GPUs despite being designed for datacenter use
  • Temperature monitoring is essential for sustained LLM inference on hardware not designed for 24/7 compute loads
  • Alert thresholds need tuning based on your specific GPU and workload—80°C might be too conservative for some cards, too late for others
  • The for duration in alert rules is critical to avoid noise from transient spikes
  • Prometheus + Alertmanager is overkill if you only need GPU monitoring, but if you're already running it for other metrics, adding GPU alerts is straightforward

This setup gives me confidence to leave inference jobs running unattended. I know I'll get notified before thermal issues cause performance degradation or hardware stress.