I have a fair amount of hands-on experience with LSF, and this is actually interfering with my learning of SLURM.
I am particularly confused by srun and salloc, since I cannot map them to any of the LSF commands I am familiar with.
Most of all, I am puzzled by the fact that I have come across several "recipes" for starting an interactive session, some of which use srun for this, while the other ones use salloc.
For example, in the system I have access to, SLURM is configured so that either one of the following two commands would start an interactive shell session for me:
$ srun   --mem=$MEMORY --pty -- $SHELL
$ salloc --mem=$MEMORY
What exactly is the relationship between srun and salloc?  Is salloc a special case of srun?
The srun command was initially designed to start a parallel program on the resources allocated the job, you can think of it as a Slurm-provided mpirun. The idea was that if you want a batch job, you create a submission script in which you use the srun command to start the program, and submit that job with sbatch. If you want an interative job, you rather run salloc, then, when the resources are allocated, you run srun whithin that allocation, in an interactive way. The salloc command only requests resources and updates the environment with the SLURM variables. At that time, salloc would "leave you" on the login node and you had to run srun to obtain a shell on a compute node.
For debugging, if often happens that what users need is simply a Bash session on a single compute node. Or they need to run multiple short-lived parallel jobs. In both cases, running salloc then srun can be seen as a loss of time. Therefore, very early on, srun was given the ablilty to request an allocation if it is not run within one. Since then, the recommendation for obtaining an interactive Bash session on a compute node was to run srun ... --pty -- $SHELL. But that only worked for single-node jobs, prevented to further run srun within that interactive job (to start an MPI program for instance), and led to confusion in the mind of many users.
Since version 20.11, the InteractiveStepOptions defines the behaviour of salloc when launched. By default, now, with LaunchParameters=use_interactive_step it runs srun so that a shell is started on the first node of the allocation. This works both for single-node jobs as for multi-node jobs. The new recommendation for an interactive job with the default configuration, is therefore now to run salloc ..., covering most usecases properly, and simplifing the situation.
Now the situation is a bit clearer:
sbatch; orsallocsrunsbcastIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With