A dead man's switch for a single-host monitoring stack
I run Prometheus, Alertmanager, and Grafana on a single mini PC, which has an obvious blind spot: if that box goes down, nothing can alert me, because the thing that sends alerts is the thing that’s down. Prometheus once sat dead for two days before I noticed, and only because a dashboard had gone blank.
The fix is a dead man’s switch: an alert that fires constantly, routed to an outside service that complains when it stops arriving. Every other alert fires when something breaks; this one fires all the time, and silence is the failure signal.
The rule is just vector(1), which is always true:
# alerts/meta.yml
- alert: Watchdog
expr: vector(1)
labels:
severity: none
kube-prometheus-stack ships the same alert under the same name. Then route that one alert away from Telegram and into a webhook:
routes:
- matchers:
- alertname = "Watchdog"
receiver: deadmansswitch
group_wait: 0s
repeat_interval: 5m # re-ping every 5 minutes
receivers:
- name: deadmansswitch
webhook_configs:
- url: https://hc-ping.com/<uuid>
send_resolved: false
The webhook target is a healthchecks.io check. Alertmanager pings it every five minutes while the pipeline is healthy; I set the check to a ~10 minute period with a ~5 minute grace, so a real outage pages me within about 15 minutes, from a service that isn’t on my network.
Two details that bit me: the ping URL lives in my git-ignored alertmanager.yml (low-sensitivity, like the Telegram chat ID), and send_resolved: false so the “resolved” call doesn’t muddy the heartbeat.
The same trick works for any scheduled job: have the script curl a healthchecks.io ping URL on success, and let the absence page you.