Tech Expert & Vibe Coder

With 15+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Setting up alertmanager to deduplicate noisy prometheus alerts from flapping home assistant sensors

Why I Needed This

I run Home Assistant on my network to track everything from temperature sensors to door contacts. Most of the time it works fine, but some sensors are flaky. A basement humidity sensor loses WiFi for a few seconds every hour. A motion detector occasionally drops offline when my mesh network decides to rebalance. Every time this happens, Prometheus fires an alert.

The problem isn’t that the sensors fail—they recover on their own. The problem is my phone lighting up with the same alert five times in an hour because the sensor keeps flapping between available and unavailable states. I needed a way to acknowledge that yes, this sensor is unstable, but no, I don’t need to be woken up about it every time.

That’s where Alertmanager’s deduplication and grouping features came in.

My Setup Before This

I already had Prometheus scraping Home Assistant’s metrics endpoint. Every entity in Home Assistant exposes its state as a metric, so I could write alert rules for things like:

  • Sensor unavailable for more than 5 minutes
  • Battery level below 20%
  • Temperature out of range

The alerts worked, but they were noisy. A single flapping sensor would trigger the same alert repeatedly because Prometheus evaluates rules on every scrape cycle. Even with a 1-minute evaluation interval, that’s a lot of duplicate notifications.

I was using Alertmanager already, but only for routing alerts to different channels. I hadn’t configured any grouping or inhibition rules, so every alert fired independently.

What I Changed

Grouping Alerts by Sensor

The first thing I did was group alerts by the entity_id label. This meant that if the same sensor triggered multiple alerts within a time window, Alertmanager would bundle them into a single notification instead of sending each one separately.

In my alertmanager.yml, I added a route specifically for Home Assistant alerts:

route:
  receiver: 'default'
  group_by: ['alertname', 'entity_id']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    - match:
      job: 'homeassistant'
    receiver: 'homeassistant-notifications'
    group_by: ['entity_id']
    group_wait: 2m
    group_interval: 10m

The group_wait of 2 minutes means Alertmanager waits that long after the first alert fires before sending a notification. If the sensor recovers in that window, I don’t get notified at all. If it flaps again within the next 10 minutes (group_interval), those alerts get bundled into the same notification.

Adding Context to Alerts

Grouped notifications are only useful if I can tell what’s actually happening. I updated my Prometheus alert rules to include more context in the labels and annotations:

- alert: SensorUnavailable
  expr: homeassistant_sensor_state{entity_id=~"sensor.*"} == 0
  for: 5m
  labels:
    severity: warning
    entity_id: "{{ $labels.entity_id }}"
    friendly_name: "{{ $labels.friendly_name }}"
  annotations:
    summary: "Sensor {{ $labels.friendly_name }} unavailable"
    description: "{{ $labels.entity_id }} has been unavailable for 5 minutes"

This way, when Alertmanager groups alerts, the notification shows which sensors are affected and how long they’ve been down.

Inhibition for Known Flaky Sensors

Some sensors flap constantly and I’ve accepted that. For those, I created an inhibition rule so that if a more critical alert fires (like the entire Home Assistant instance going down), the flaky sensor alerts get suppressed.

inhibit_rules:
  - source_match:
      alertname: 'HomeAssistantDown'
    target_match_re:
      alertname: 'SensorUnavailable|SensorBatteryLow'
    equal: ['job']

This prevents me from getting flooded with sensor alerts when the real problem is that Home Assistant itself is offline.

What Worked

The grouping alone cut my notification volume by about 70%. Instead of getting pinged every time a sensor blinked offline, I’d get one notification saying “these three sensors have been flapping for the last 10 minutes.” If they recovered quickly, I wouldn’t hear about it at all.

The group_wait setting was particularly useful. Most sensor drops resolve themselves in under a minute, so waiting 2 minutes before notifying meant I only heard about actual problems.

Inhibition helped during network issues. When my router rebooted, I used to get alerts for every single sensor going offline. Now I just get one alert for the network itself, and the sensor alerts stay quiet.

What Didn’t Work

I initially set group_interval too low—1 minute. This meant that even with grouping enabled, I was still getting notifications too frequently. Bumping it to 10 minutes felt like the right balance between being informed and not being spammed.

I also tried using repeat_interval to stop getting reminders about ongoing issues, but that backfired. Setting it to 24 hours meant I forgot about sensors that stayed offline for days. I settled on 4 hours as a reminder that something still needs attention without being annoying.

One thing I couldn’t solve cleanly: differentiating between “sensor is flapping” and “sensor is actually dead.” A sensor that drops offline every 5 minutes looks the same in Prometheus as one that’s been offline for an hour. I ended up creating separate alert rules with different for durations—5 minutes for flapping, 1 hour for actually dead—but it’s not perfect.

Key Takeaways

Alertmanager’s grouping is designed for exactly this kind of problem. If you have metrics that flap or fire repeatedly, grouping by a unique identifier (like entity_id) keeps your notification channels sane.

The timing settings matter more than I expected. group_wait and group_interval need to match the behavior of your sensors. Too short and you still get noise. Too long and you miss real issues.

Inhibition rules are powerful but require careful thought. It’s easy to accidentally suppress alerts you actually care about. I keep my inhibition rules narrow and only use them for obvious cases like “don’t alert on sensors when the whole system is down.”

This setup isn’t perfect, but it made my alerting usable. I still get notified about real problems, but I’m not constantly dismissing alerts for sensors that just need a WiFi reconnect.

Leave a Comment

Your email address will not be published. Required fields are marked *