Is there a way to monitor kube cronjob.
I have a kube cronjob which runs every 10mins on my cluster.. is there a way to collect metrics everytime my cronjob fails due to some error or notify when my cronjob has not been completed after a certain period of time.
System cron monitoring Connect these VMs to the Prometheus Pod. On VMs, configure node exporters to send system cron status metrics to Prometheus. In Prometheus, use these metrics to set up trigger rules for alerts. In a Grafana Dashboard, display the latest status of each OpenShift CronJob and each system cron.
A CronJob creates Jobs on a repeating schedule. One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.
I'm using these rules with kube-state-metrics:
groups:
- name: job.rules
  rules:
  - alert: CronJobRunning
    expr: time() -kube_cronjob_next_schedule_time > 3600
    for: 1h
    labels:
      severity: warning
    annotations:
      description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
      summary: CronJob didn't finish after 1h
  - alert: JobCompletion
    expr: kube_job_spec_completions - kube_job_status_succeeded  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job completion is taking more than 1h to complete
        cronjob {{$labels.namespaces}}/{{$labels.job}}
      summary: Job {{$labels.job}} didn't finish to complete after 1h
  - alert: JobFailed
    expr: kube_job_status_failed  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
      summary: Job failed
The tricky part here is the cronjobs themselves have no useful status, you have to match them to the jobs they create. I've written up an article on how to achieve this:
https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511
The article goes into a bit of detail as to how things work, but the alert config is as follow:
groups:
- name: kube-cron
  rules:
  - record: job_cronjob:kube_job_status_start_time:max
    expr: |
      label_replace(
        label_replace(
          max(
            kube_job_status_start_time
            * ON(exported_job) GROUP_RIGHT()
            kube_job_labels{label_cronjob!=""}
          ) BY (exported_job, label_cronjob)
          == ON(label_cronjob) GROUP_LEFT()
          max(
            kube_job_status_start_time
            * ON(exported_job) GROUP_RIGHT()
            kube_job_labels{label_cronjob!=""}
          ) BY (label_cronjob),
          "job", "$1", "exported_job", "(.+)"),
        "cronjob", "$1", "label_cronjob", "(.+)")
  - record: job_cronjob:kube_job_status_failed:sum
    expr: |
  clamp_max(
        job_cronjob:kube_job_status_start_time:max,
      1)
      * ON(job) GROUP_LEFT()
      label_replace(
        label_replace(
          (kube_job_status_failed != 0),
          "job", "$1", "exported_job", "(.+)"),
        "cronjob", "$1", "label_cronjob", "(.+)")
  - alert: CronJobStatusFailed
    expr: |
      job_cronjob:kube_job_status_failed:sum
      * ON(cronjob) GROUP_RIGHT()
      kube_cronjob_labels
      > 0
    for: 1m
    annotations:
      description: '{{ $labels.cronjob }} last run has failed {{$value }} times.'
The jobTemplate must include a label called cronjob that matches the name of the cronjob object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With