Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I find missing metrics in Prometheus that have a certain label?

Tags:

prometheus

I have the following expression for an alert in Prometheus:

absent_over_time(node_filesystem_size_bytes{device=~"/dev/nvme.*"}[5m]) > 0

As far as I can tell, this will be > 0 if there are no metrics for /dev/nvme* in the last 5 minutes.

However, I would like this to take into account the instance that it runs on. That is, I want this to be triggered if any instance (preferably with a label saying which one(s)) is missing this metric over the last 5 minutes. I'm assuming that if one node is successfully scraped, this condition won't be true anymore.

There is an instance label on the node_filesystem_size_bytes metric, but if that metric is missing, I'm not sure how it could detect that.

Do I need to somehow get a list of instances at the current time and join it with node_filesystem_size_bytes to accomplish this? Or is there some other way?

This is using node-exporter and the kube-prometheus-stack in a Kubernetes cluster.

like image 956
Andrew DiNunzio Avatar asked Sep 01 '25 05:09

Andrew DiNunzio


1 Answers

Try the following query:

(node_filesystem_size_bytes{device=~"/dev/nvme.*"} offset 5m)
  unless
node_filesystem_size_bytes{device=~"/dev/nvme.*"}

It should return time series matching the node_filesystem_size_bytes{device=~"/dev/nvme.*"} with all their labels, which had raw samples 5 minutes ago, but have no new samples aftewards.

Note that this query can be simplified to the following one when using VictoriaMetrics - Prometheus-like system I work on:

lag(node_filesystem_size_bytes{device=~"/dev/nvme.*"}[10m]) > 5m

It uses the lag() function, which returns the duration since the last raw sample per every input time series.

like image 53
valyala Avatar answered Sep 05 '25 03:09

valyala