I have a pending job and I want to resize it. I tried:
scontrol update job <jobid> NumNodes=128
It does not work.
Note: I can change the walltime using scontrol
. But when I try to change number of nodes, it failed. It looks like I can change the nodes according to this page http://www.nersc.gov/users/computational-systems/cori/running-jobs/monitoring-jobs/.
Here is a solution I got from NERSC help desk (Credits to Woo-Sun Yang at LBNL):
$ scontrol update jobid=jobid numnodes=new_numnodes-new_numnodes
E.g. $ scontrol update jobid=12345 numnodes=10-10
The trick is to have numnodes in the above format. It works for both shrinking and expanding your nodes.
You can resize jobs in Slurm provided that the job is pending or running.
According to the FAQ, you can resize following the next steps (with examples):
Assuming that j1 requests 4 nodes and is submitted with:
$ salloc -N4 bash
Submit a new job (j2) with the number of extra nodes for j1 (in this case 10 for a total of 14 nodes) and make it dependent of j1 (SLURM_JOBID):
$ salloc -N10 --dependency=expand:$SLURM_JOBID
Deallocate the nodes of j2:
$ scontrol update jobid=$SLURM_JOBID NumNodes=0
Terminate j2:
$ exit
Assign to j1 the previous released nodes:
$ scontrol update jobid=$SLURM_JOBID NumNodes=ALL
And update the environmental variables of j1:
$ ./slurm_job_$SLURM_JOBID_resize.sh
Now, j1 has 14 nodes.
Assuming that j1 has been submitted with:
$ salloc -N4 bash
Update j1 to the new size:
$ scontrol update jobid=$SLURM_JOBID NumNodes=2
$ scontrol update jobid=$SLURM_JOBID NumNodes=ALL
And update the environmental variables of j1 (the script is created by the previous commands):
$ ./slurm_job_$SLURM_JOBID_resize.sh
Now, j1 has 2 nodes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With