I have have a bash script submit.sh
for submitting training jobs to a Slurm server. It works as follows. Doing
bash submit.sh p1 8 config_file
will submit some task corresponding to config_file
to 8 GPUs of partition p1
. Each node of p1
has 4 GPUs, thus this command requests 2 nodes.
The content of submit.sh
can be summarized as follows, in which I use sbatch
to submit a Slurm script (train.slurm
):
#!/bin/bash
# submit.sh
PARTITION=$1
NGPUs=$2
CONFIG=$3
NGPUS_PER_NODE=4
NCPUS_PER_TASK=10
sbatch --partition ${PARTITION} \
--job-name=${CONFIG} \
--output=logs/${CONFIG}_%j.log \
--ntasks=${NGPUs} \
--ntasks-per-node=${NGPUS_PER_NODE} \
--cpus-per-task=${NCPUS_PER_TASK} \
--gres=gpu:${NGPUS_PER_NODE} \
--hint=nomultithread \
--time=10:00:00
--export=CONFIG=${CONFIG},NGPUs=${NGPUs},NGPUS_PER_NODE=${NGPUS_PER_NODE} \
train.slurm
Now in the Slurm script, train.slurm
, I decide whether to launch the training Python script on one or multiple nodes (the ways to launch it are different in these two cases):
#!/bin/bash
# train.slurm
#SBATCH --distribution=block:block
# Load Python environment
module purge
module load pytorch/py3/1.6.0
set -x
if [ ${NGPUs} -gt ${NGPUS_PER_NODE} ]; then # Multi-node training
# Some variables needed for the training script
export MASTER_PORT=12340
export WORLD_SIZE=${NGPUs}
# etc.
srun python train.py --cfg ${CONFIG}
else # Single-node training
python -u -m torch.distributed.launch --nproc_per_node=${NGPUS_PER_NODE} --use_env train.py --cfg ${CONFIG}
fi
Now if I submit on a single node (e.g., bash submit.sh p1 4 config_file
), it works as expected. However, submitting on multiple nodes (e.g., bash submit.sh p1 8 config_file
) produced the following error:
slurmstepd: error: execve(): python: No such file or directory
This means that the Python environment was not recognized on one of the nodes. I tried replacing python
with $(which python)
to take the full path to the Python binary in the virtual environment, but then I obtained another error:
OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory
If I don't use submit.sh
but instead, add all the #SBATCH
variable to train.slurm
, and submit the job using sbatch
directly from the command line, then it works. Therefore, it seems that wrapping sbatch
inside a bash script caused this issue.
Could you please help me to resolve this?
Thank you so much in advance.
Beware that the --export
parameter will cause the environment for srun
to be reset to exactly all the SLURM_*
variables plus the ones explicitly set, so in your case CONFIG
,NGPUs
, NGPUS_PER_NODE
. Consequently, the PATH
variable will not be set and srun
will not find the python
executable.
Note that the --export
does not alter the environment of the submission script, so the single-node case, that does not use srun
, does indeed run fine.
Try submitting with
--export=ALL,CONFIG=${CONFIG},NGPUs=${NGPUs},NGPUS_PER_NODE=${NGPUS_PER_NODE} \
Note the added ALL
as first item in the list.
Another option is to simply remove the --export
line entirely and export the variables explicitly in the submit.sh
script as the submission environment is propagated by default by Slurm to the job.
export PARTITION=$1
export NGPUs=$2
export CONFIG=$3
export NGPUS_PER_NODE=4
export NCPUS_PER_TASK=10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With