Sorry if the question is too open-ended or otherwise not suitable, but this is due to my lack of understanding about several pieces of technology/software, and I'm quite lost. I have a project where I have an existing java swing GUI, which runs MPI jobs on a local machine. However, it is desired to support running MPI jobs on HPC clusters (let's assume linux cluster with ssh access). To be more specific, the main backend executable (linux and windows) that I need to, erm, execute uses a very simple master-slave system where all relevant output is performed by the master node only. Currently, to run my backend executable on multiple machines, I would simply need to copy all necessary files to the machines (assuming no shared filespace) and call "mpiexec" or "mpirun" as is usual practice. The output produced by the master needs to be read in (or partially read in) by my GUI.
The main problem as I see things is this: Where to run the GUI? Several options:
- Local machine - potential problem is needing to read data from cluster back to local machine (and also reading stdout/stderr of the cluster processes) to display current progress to user.
- Login node - obvious problem of hogging precious resources, and in many cases will be banned.
- Compute node - sounds pretty dodgy - especially if the cluster has a queuing system (slurm, sun grid, etc)! Also possibly banned.
Of these three options, the first seems the most reasonable, and also seems least likely to upset any HPC admin people, but is also the hardest to implement! There are multiple problems associated with that setup:
- Passing data from cluster to local machine - because we're using a cluster - by definition we probably will generate large amounts of data, which the user wants to see at least part of! Also, how should this be done? I can see how to execute commands on remote machine via ssh using jsch or similar, but if i'm currently logged in on the remote machine - how do I communicate information back to the local machine?
- Displaying stdout/stderr of backend in local machine. Similar to above.
- Dealing with peculiar aspects of individual clusters - the only way I see around that is to allow the user to write custom slurm scripts or such like.
- How to detect if backend computations have finished/failed - this problem interacts with any custom slurm scripts written by user.
Hopefully it should be clear from the above that I'm quite confused. I've had a look at apache camel, jsch, ganemede ssh, apache mina, netty, slurm, Sun Grid, open mpi, mpich, pmi, but there's so much information that I think I need to ask for some help and advice. I would greatly appreciate any comments regarding these problems!
Thanks
================================
Edit
Actually, I just came across this: link which seems to suggest that if the cluster allows an "interactive"-mode job, then you can run a GUI from a compute node. However, I don't know much about this, nor do I know if this is common. I would be grateful for comments on this aspect.