The Sun Grid Engine is a queue and scheduler that accepts jobs and runs them on the cluster for the user. There are three types of jobs available, interactive, batch, parallel.
It is assumed you are logged into the cluster and know how to create and edit files, etc. It should also be noted that one should never assume a program to be on the path, and as such one should always call programs by their full path name(ie /usr/local/bin/xxxx). This also helps when a script doesn't work and needs to be debugged.
Running Batch(Serial) Jobs with SGE:
A batch(or serial) job is one that is run on a single node. This is in contrast to the case where a single job is run on many nodes in an interconnected fashion, generally using MPI to communicate in between individual processes. If you are running the same program on the cluster as you would on your desktop, chances are you will want to use a serial job.
Some things to keep in mind when creating jobs is your directory structure. Its a good idea to organize files needed for a job into a single folder. If there are read-only files needed by mutiple jobs, using symlinks is a good idea so there are no duplicate files taking up extra space. An example of a good directory structure could be:
Project1/
Project1/jobA
Project1/jobB
In this example, we will run a matlab script:
#!/bin/bash
# The name of the job, can be anything, simply used when displaying the list of running jobs
#$ -N matlab-test
# Giving the name of the output log file
#$ -o matlabTest.log
# Combining output/error messages into one file
#$ -j y
# One needs to tell the queue system to use the current directory as the working directory
# Or else the script may fail as it will execute in your top level home directory /home/username
#$ -cwd
# Now comes the commands to be executed
/share/apps/matlab/bin/matlab -nodisplay -nodesktop -nojvm -r matlab-test
# Note after -r is not the name of the m-file but the name of the routine
exit 0
qsub matlab-test.jobWhen the job is completed you can check the output of the job in the filename given above, matlabTest.log
NOTE:You may see the following in the output
“Warning: no access to tty (Bad File descriptor).
Thus no job control in this shell.”
This is normal and can be ignored. And in the case of matlab, you may see a message about shopt, again for matlab this is normal and can be ignored.
Attached is the sample job and matlab script.
Running Interactive Jobs with SGE:
An interactive job is when you are running a program interactively on a node. This is good in the case of building/testing scripts, etc. This is not the place to run long running, very computationally intensive, or other jobs better suited to run in a batch job. An example would be the development of a matlab script. You can launch an interactive job, develop the script and write the job file. But when it comes to running the job itself, it needs to be submitted as a batch job.
To run an interactive job, simply type qlogin
Running Parallel Jobs with SGE:
A parallel job is where a single job is run on many nodes in an interconnected fashion, generally using MPI to communicate in between individual processes. If you are running the same program on the cluster as you would on your desktop, chances are you will want to use a serial job, not a parallel job. Parallel jobs generally are only for specially designed programs which will only work on machines with cluster management software installed.
Also not just any program can run in parallel, it must be programmed as such and compiled against a particular mpi library. In this case we build a simply program that passes a message between processes and compile it against the OpenMPI, the main mpi library of the cluster.
Also note that the scheduler will only accept parallel jobs between 4 to 8 slots. It is currently setup to start parallel processes on a single node to limit the overhead of inter-process communication over the network, which adds considerable run time to the job. For most jobs, more slots is not always best
#!/bin/bash
#$ -N openmpi-test
# Here we tell the queue that we want the orte parallel enivironment and request 4 slots
# This option take the following form: -pe nameOfEnv min-Max
# Where you request a min and max number of slots
#$ -pe orte 4-8
# For parallel jobs, its a good idea to use even numbers.
#$ -cwd
#$ -j y
/opt/openmpi/bin/mpirun -n $NSLOTS mpi-ring
exit 0
NOTES:There are a few queue commands to know
commlib error: got select error (Connection refused) unable to send message to qmaster using port 536 on host "cheetah.cbi.utsa.edu": got send errorSGE Environment Options And Environment Variables:
When a Sun Grid Engine job is run, a number of variables are preset into the job’s script environment, as listed below.
Advanced Jobs
Using other MPI Environments:
Besides the default mpi environment for openmpi, mpich2 is installed on the system at /opt/mpich2/gnu. To setup your environment to use mpich2 instead of openmpi, you'll have to alter your shell environment. To do so, use your text editor to edit /home/username/.bash_profile and add the following:
export PATH=/opt/mpich2/gnu/bin:$PATH
export LD_LIBRARY_PATH=/opt/mpich2/gnu/lib:$LD_LIBRARY_PATH
export LD_RUN_PATH=/opt/mpich2/gnu/lib:$LD_RUN_PATH
This adds mpich2 to the path and to the library path. When compiling programs, be sure to tell the configure script where mpicc/mpif90/etc are located by using the full path.
Launching an mpich2 job:
The job script is similar, but includes a few extra directives needed for mpich2
#!/bin/bash
#$ -N jobName
#$ -cmd
#$ -S /bin/bash
#$ -pe mpich2 min-Max
export MPICH2_ROOT=/opt/mpich2/gnu
export PATH=$MPICH2_ROOT/bin:$PATH
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
/opt/mpich2/gnu/bin/mpiexec -machinefile $TMPDIR/machines -n $NSLOTS /path/to/program
exit 0
The Job Array:
Under Construction
| Attachment | Size |
|---|---|
| matlab-test.tar | 10 KB |
| openmpi-test.tar | 10 KB |