Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slurm: how many times will failed jobs be --requeue'd

I have a Slurm job array for which the job file includes a --requeue directive. Here is the full job file:

#!/bin/bash
#SBATCH --job-name=catsss
#SBATCH --output=logs/cats.log
#SBATCH --array=1-10000
#SBATCH --requeue
#SBATCH --partition=scavenge
#SBATCH --mem=32g
#SBATCH --time=24:00:00
#SBATCH --mail-type=FAIL
#SBATCH [email protected]
module load Langs/Python/3.4.3
python3 cats.py ${SLURM_ARRAY_TASK_ID} 'cats'

Several of the array values have restarted at least once. I would like to know, how many times will these jobs restart before they are finally cancelled by the scheduler? Will the restarts carry on indefinitely until a sysadmin manually cancels them, or do jobs like this have a maximum number of retries?

like image 843
duhaime Avatar asked Nov 24 '25 11:11

duhaime


1 Answers

AFAIK, the jobs can be requeued in infinite occasions. You just decide if the job is prepared to be requeued or not. If not-requeue, then it will never be requeued. If requeue, then it will be requeued everytime the system decides it is needed (node failure, higher priority job preemption...).

The jobs keep restarting until they finish (successfully or not, but finished instead of interrupted).

like image 95
Poshi Avatar answered Nov 27 '25 00:11

Poshi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!