Paul's Programming Notes PostsRSSGithub

A dead man's switch for a single-host monitoring stack

I run Prometheus, Alertmanager, and Grafana on a single mini PC, which has an obvious blind spot: if that box goes down, nothing can alert me, because the thing that sends alerts is the thing that’s down. Prometheus once sat dead for two days before I noticed, and only because a dashboard had gone blank.

The fix is a dead man’s switch: an alert that fires constantly, routed to an outside service that complains when it stops arriving. Every other alert fires when something breaks; this one fires all the time, and silence is the failure signal.

The rule is just vector(1), which is always true:

# alerts/meta.yml
- alert: Watchdog
  expr: vector(1)
  labels:
    severity: none

kube-prometheus-stack ships the same alert under the same name. Then route that one alert away from Telegram and into a webhook:

routes:
  - matchers:
      - alertname = "Watchdog"
    receiver: deadmansswitch
    group_wait: 0s
    repeat_interval: 5m   # re-ping every 5 minutes

receivers:
  - name: deadmansswitch
    webhook_configs:
      - url: https://hc-ping.com/<uuid>
        send_resolved: false

The webhook target is a healthchecks.io check. Alertmanager pings it every five minutes while the pipeline is healthy; I set the check to a ~10 minute period with a ~5 minute grace, so a real outage pages me within about 15 minutes, from a service that isn’t on my network.

Two details that bit me: the ping URL lives in my git-ignored alertmanager.yml (low-sensitivity, like the Telegram chat ID), and send_resolved: false so the “resolved” call doesn’t muddy the heartbeat.

The same trick works for any scheduled job: have the script curl a healthchecks.io ping URL on success, and let the absence page you.