Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to query Prometheus to count failed jobs in time range?

There are several metrics collected for cron jobs, unfortunately I‘m not sure how to use them properly.

I wanted to use the kube_job_status_failed == 1 metrics. I can use a regex for job=~“+.myjobname.+“ to aggregate all failed attempts for a cron job.

This is where i got stuck. Is there a way to count the amount of distinct labels(=number of failed attempts) in a given time period?

Or can I use the metrics the other way around meaning checking whether there was kube_job_status_succeeded{job=~“+.myjobname+.“}==1 in a given time period?

I feel like I’m so close to solving this but I just can’t wrap my head around it.

EDIT: Added PictureThis shows that there clearly are several succeded jobs over time, I just have no clue on how to count them

like image 939
Paul7979 Avatar asked Nov 06 '25 02:11

Paul7979


1 Answers

Alright people, here is a somewhat gross way to do this that you can generalize for gauges that you only want to count the initial value of:

Step 1: Make it so you can count the gauge value just once (effectively count the amount of distinct labels):

sum(kube_job_failed{condition="true"} unless kube_job_failed offset 1m)

What you will see with this metric is a graph of job failures when they happen that don't persist after.

This is assuming a scrape interval of 1m. If you scrape kube-state-metrics every 30s this will double count some and you should use 30s. The way that this works is that we're doing a left anti-join with unless to remove all metrics in the range vector that existed 1 scrape interval earlier. That will let you count the metrics just once, the first time they are scraped.

Step 2:

sum_over_time(sum(kube_job_failed{condition="true"} unless kube_job_failed offset 1m)[1h:]

This is going to sum the previous query for the time range you give it - in this case, the past 1 hour.

like image 125
Ramfjord Avatar answered Nov 09 '25 07:11

Ramfjord