Is there a way to query Prometheus to count failed jobs in time range?

Question

There are several metrics collected for cron jobs, unfortunately I‘m not sure how to use them properly.

I wanted to use the kube_job_status_failed == 1 metrics. I can use a regex for job=~“+.myjobname.+“ to aggregate all failed attempts for a cron job.

This is where i got stuck. Is there a way to count the amount of distinct labels(=number of failed attempts) in a given time period?

Or can I use the metrics the other way around meaning checking whether there was kube_job_status_succeeded{job=~“+.myjobname+.“}==1 in a given time period?

I feel like I’m so close to solving this but I just can’t wrap my head around it.

EDIT: Added PictureThis shows that there clearly are several succeded jobs over time, I just have no clue on how to count them

Ramfjord · Accepted Answer

Alright people, here is a somewhat gross way to do this that you can generalize for gauges that you only want to count the initial value of:

Step 1: Make it so you can count the gauge value just once (effectively count the amount of distinct labels):

sum(kube_job_failed{condition="true"} unless kube_job_failed offset 1m)

What you will see with this metric is a graph of job failures when they happen that don't persist after.

This is assuming a scrape interval of 1m. If you scrape kube-state-metrics every 30s this will double count some and you should use 30s. The way that this works is that we're doing a left anti-join with unless to remove all metrics in the range vector that existed 1 scrape interval earlier. That will let you count the metrics just once, the first time they are scraped.

Step 2:

sum_over_time(sum(kube_job_failed{condition="true"} unless kube_job_failed offset 1m)[1h:]

This is going to sum the previous query for the time range you give it - in this case, the past 1 hour.

Is there a way to query Prometheus to count failed jobs in time range?

Tags:

kubernetes

prometheus

grafana

Paul7979

1 Answers

Ramfjord

Recent Activity

Donate For Us

Is there a way to query Prometheus to count failed jobs in time range?

Tags:

kubernetes

prometheus

grafana

Paul7979

1 Answers

Ramfjord

Related questions

Recent Activity

Donate For Us