Alert based on metric value compared to one of its own label values

Question

I am using kube-prometheus-stack and the yaml snippets you see below are part of a PrometheusRule definition.

This is a completely hypothetical scenario, the simplest one I could think of that illustrates my point.

Given this kind of metric:

cpu_usage{job="job-1", must_be_lower_than="50"} 33.72
cpu_usage{job="job-2", must_be_lower_than="80"} 56.89
# imagine there are plenty more lines here
# with various different values for the must_be_lower_than label
# ...

I'd like to have alerts that check the label must_be_lower_than and alert. Something like this (this doesn't work the way it's written now, just trying to demonstrate):

alert: CpuUsageTooHigh
annotations:
  message: 'On job {{ $labels.job }}, the cpu usage has been above {{ $labels.must_be_lower_than }}% for 5 minutes.'
expr: cpu_usage > $must_be_lower_than
for: 5m

P.S I already know I can define alerts like this:

alert: CpuUsageTooHigh50
annotations:
  message: 'On job {{ $labels.job }}, the cpu usage has been above 50% for 5 minutes.'
expr: cpu_usage{must_be_lower_than="50"} > 50
for: 5m
---
alert: CpuUsageTooHigh80
annotations:
  message: 'On job {{ $labels.job }}, the cpu usage has been above 80% for 5 minutes.'
expr: cpu_usage{must_be_lower_than="80"} > 80
for: 5m

This is not what I'm looking for, because I have to manually define alerts for some of the various values of the must_be_lower_than label.

Michael Doubez · Accepted Answer

See @markalex comment to this post, the absent() function can be used for generating metric with tags:

cpu_usage > ON(must_be_lower_than) GROUP_LEFT (absent(non_existent{must_be_lower_than="80"}) * 80 or absent(non_existent{must_be_lower_than="50"}) * 50)

old answer

There is currently no way in Prometheus to have this kind of "templating".

The only way to get something near would be to use recording rules that that define the maximum value for the label:

rules:
- record: max_cpu_usage
  expr: vector(50)
  labels:
    must_be_lower_than:"50"
- record: max_cpu_usage
  expr: vector(80)
  labels:
    must_be_lower_than:"80"
# ... other possible values

Then use it in your alerting rule:

alert: CpuUsageTooHigh
annotations:
  message: 'On job {{ $labels.job }}, the cpu usage has been above {{ $labels.must_be_lower_than}}% for 5 minutes.'
expr: cpu_usage > ON(must_be_lower_than) GROUP_LEFT max_cpu_usage
for: 5m

markalex · Answer

Prometheus is still (and I believe will always be) against mixing up labels and values. The only exception of this rule is method count_values that allows to convert value of metric into label, but that's it: no mechanisms to do the opposite are available.

Regarding your idea, I believe you trying to do it in a bit incorrect way. If you want to create alert for some of your metrics based on threshold specific to the machine this metrics are from, you should use additional metrics instead of additional labels.

I'm sorry I'm not that familiar with kube-stack, so I'll be using node exporter as an example, I believe this example should be easy to scale on another exporters.

So for your example, you should create a textfile metric

my_metric_threshold 80

configure node exporter to expose this textfile metric, and then use alert rule

alert: CpuUsageTooHigh
annotations:
  message: 'On job {{ $labels.job }}, the cpu usage has been above {{ value }}% for 5 minutes.'
expr: my_metric_threshold > on(instance) my_metric
for: 5m

This way your thresholds are tied to your machine, and you don't need to reload Prometheus' config when you decided to add or change threshold.

Also, you can have more granularity with less hustle. To keep up with node exporter example you can use textfile metrics

cpu_must_be_lower_than{cpu="0"} 80
cpu_must_be_lower_than{cpu="1"} 50

and expression cpu_must_be_lower_than > on(instance, cpu) (100 - 100 * rate(node_cpu_seconds_total{mode="idle"})).

Alert based on metric value compared to one of its own label values

Tags:

kubernetes

kubernetes-helm

prometheus

prometheus-alertmanager

prometheus-operator

igg

2 Answers

Michael Doubez

markalex

Recent Activity

Donate For Us

Alert based on metric value compared to one of its own label values

Tags:

kubernetes

kubernetes-helm

prometheus

prometheus-alertmanager

prometheus-operator

igg

2 Answers

Michael Doubez

markalex

Related questions

Recent Activity

Donate For Us