My situation is the cluster consisted of 3 PCs (Raspbian with slurm 18), all connected together with shared file storage, mounted as /storage.
The task file is /storage/multiple_hello.sh:
#!/bin/bash
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=3
#SBATCH --ntasks=3
cd /storage
srun echo "Hello World from $(hostname)" >> ./"$SLURM_JOB_ID"_$(hostname).txt
It is ran as sbatch /storage/multiple_hello.sh and the expected outcome is creating in /storage 3 files named 120_node1.txt, 121_node2.txt and 122_node3.txt (arbitrary job numbers) since:
Real output: created one file only: 120_node1.txt
How to make it work as intended?
Weird enoughh, the srun --nodes=3 hostname works as expected, and returns:
node1
node2
node3
To get the expected result, modify the last line as
srun bash -c 'echo "Hello World from $(hostname)" >> ./"$SLURM_JOB_ID"_$(hostname).txt'
The way Bash parses the line is different from what you are expecting. First, $hostname and $SLURM_JOBID are expanded on the first node of the allocation (the one that runs the submission script), then srun is run, and its output is appended to the file. You need to be specific that the redirection >> is part of what you want srun to do. With the above solution, the variable and command expansions are done on each node, as well as the redirection.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With