Why singularity containers behave differently on login vs computing nodes on slurm HPC

Question

Why I am able to run an R script in a singularity container manually without problems, but as an array job on slurm HPC it fails on various levels. This is in contradiction with the assumption that the containers permit reproducibility with moderate efforts.

First, I try to bind an external directory to a container directory that is parallel with the directory where the script of interest is:

srun singularity exec --bind ./extdirectory:/home/user/intdirectory image.sif Rscript /home/user/intdirectory2/script.R

Without srun this would be fine manually, but as an array job defined in the .sh file the .out files say:

Fatal error: cannot open file '/home/user/intdirectory2/script.R': No such file or directory

I do understand that if I would try to bind on the same directory where the script of interest is within the container, the mounting would override the script.

Okay, that was the first problem. Then, if I don't bind at all, now the array jobs fail as follows:

Fatal error: cannot open file '/home/user/intdirectory2/script.R': Permission denied

I have primarily build the image as OCI using podman build on Windows. Then I used podman save to export a .tar file, and on the HPC transformed it into the .sif image using singularity build image.sif docker-archive://image.tar. In the Containerfile used by the podman build I have downgraded the user privileges in the container using the following lines:

RUN useradd user
RUN chown -R user /home/user
RUN chmod -R 700 /home/user
USER user

However, when I call whoami either in the manually started singularity sessions or in array jobs, in both cases I actually see my personal user account of the underlying HPC.

I have also tried to execute the script of interest as a default operation of the container. I did this by using CMD Rscript /home/user/intdirectory2/script.R in the Containerfile and singularity run image.sif, without luck. I also tried to distribute the image file on the computing nodes by using sbcast image.sif /tmp/image.sif in the .sh file, and start the container using srun singularity exec /tmp/image.sif Rscript /home/user/intdirectory2/script.R, again without luck.

Here are some versions:

apptainer version 1.1.9-1.el7
slurm 21.08.8-2

I ended up in this situation after trying to launch the parallel containers using the system() command in R sessions on the computing nodes, but eventually gave up as discussed here. I am pretty confused. In both manual and array-job cases I see that the containers start, according to the INFO: /etc/singularity/ exists... messages, and also calling /bin/echo "hello world" from the containers works. But for some reason the underlying system affects on both the visibility (with binding to elsewhere) and permission of the R script dealt within the container.

Further research:

According to this tutorial, the singularity exec should be called without the srun in the .sh file. I think I already tried it, but I double verified and still cannot access the internal script if the binding is active.

Imsa · Answer 1 · 2023-08-25T09:53:08.993

0

Finally, I was able to bind the external directory in the container, find the R script of interest, and had permissions to use it during an array job. In the Containerfile I changed RUN chmod -R 755 /home/user, giving wider permissions on the content in the main directory. But the case is still strange and against the philosophy of reproducible containers! Because exactly the same singularity exec --bind ./extdirectory:/home/user/intdirectory image.sif Rscript /home/user/intdirectory2/script.R behaved differently between the login and computing nodes, although the user in the containers was at least seemingly the same.

edited Aug 25 '23 at 09:53

answered Aug 25 '23 at 09:47

Imsa

69
4

I can immediately say this is a problem with your build process not singularity. I am not entirely sure, why you downgraded the user privileges int eh container. – Prakhar Sharma Aug 25 '23 at 10:34
1

@PrakharSharma The build process definitely caused the bottle neck (partly). And it was difficult to figure out due to the illogical symptoms and behaviour as described above. The downgrading was done as a simple security improvement just in case, if the work happens to raise interest. It was easy to do and it didn't bother (misleadingly before the parallelized phases of the development). – Imsa Aug 25 '23 at 12:00

Why singularity containers behave differently on login vs computing nodes on slurm HPC

1 Answers1

Linked