MOCCAMED cluster overview
Here we describe our cluster, called CLUSTERINO, in use at the MOCCAMED group of the Medical University of Vienna.
Our resources did not allow us to build a dedicated machine farm for cluster computing. Therefore we decided to use the Condor batch processing software as it allows to harvest CPU time from otherwise used machines. Basically the Condor software is installed on a PC and starts calculations whenever this PC has been idle for a configurable time and stops calculations when a user needs the PCs power. Furthermore Condor allows to start and end jobs at specified times, e.g. to run calculations only over night or during weekends. Condor was designed for big high-performance computing sites, so chances are great that you won't need most of the options. However, the flexible load balancing, job and user priority management are quite useful.
Here is a short description of our environment and our restrictions. We will refer to a single machine in the cluster as node.
- Only one machine dedicated for cluster calculations
- Some machines, where CPU power is not needed, running calculations the whole time in the background
- Multiple other machines with different 64 bit Linux operating systems for part-time calculations
- Machines distributed in different network segments
- 1 GB/s network connections available
- No global user management: e.g. the users existing on a workstation may or may not be the same as on other workstations or in the cluster. Especially the user-ids are different. Therefore we can not rely that all users using the cluster have an account on all cluster machines
- Network connections may break and reconnect at any time
- Machines may be switched off for longer periods or reboot repeatedly without notice
It turned out Condor and GATE can cooperate if installed on different LINUX machines as long as all machines employ 64bit distributions.
Condor has an integrated file transfer tool which allows to transfer all file necessary for a given calculation to the calculating machine and re-transfer the results accordingly. In theory it might work to let Condor distribute the whole GATE installation to the node, transfer back the results and delete GATE after the calculation has finished. We would be very interested if someone succeed which such a configuration as we found it exceedingly difficult.
Simulation data and simulation results are shared using a network file system (NFS). This allows the mounting of the respective folders at start up, guaranteeing that all nodes have access to the data necessary for simulations. Another problem which occurred was a permission problem. If one does not employ central user management e.g. using NIS, different users exist on different machines. Furthermore even if the same users would be present, there will be a good chance that the user ids are different. Basically these ids are the unique identifier of a user and are used by Linux to control file access. In practice that means if you employ a NFS and grant user A (e.g. uid 17) read/write access, the same user A on another node might not be able to access it, simply because on this system he got e.g. uid 5. Furthermore as mentioned above, we can not guarantee that all users are present on all systems.
Therefore we decided to use a user which by default is present on all LINUX system. On typical LINUX distributions, this would be root or nobody. As we did not want to use the root user for security reasons, we took the user nobody. This one is present on all LINUX distributions but may have different uids depending on the distribution. We created a special group, called nobody of which the user nobody is a member. This group we assigned a fixed group-id which is the same on all cluster machines. Giving read/write permissions to the group nobody on the NFS, every user nobody on all machines has read/write access.
We now have to assure that all simulations are started as user nobody. The default configuration of Condor is that all simulations are started with the rights of the respective user who started the calculation. As mentioned above this won't work in our case. Employing the Condor setting
in the submission file all jobs are calculated by the user nobody, thus being able to access the NFS.
Another problem arises for GATE simulations. GATE uses environment variables to find all necessary software packages such as root or geant4. Condor enables it to send along all defined environment variables, by issuing
getenv = True
in the submission file.
As most of our calculations are Monte Carlo simulations employing GATE we decided to install GATE on all machines joining the cluster. This reduces the file transfer over the network considerably. The only thing to ensure is that all GATE paths are the same along all cluster nodes. We installed GATE and all corresponding software packages into
as this folder is usually unused.
More detailed configuration details are available on request. Please feel free to provide feedback.