Sun Grid Engine Tutorial

Running Serial Jobs with SGE

A serial job is one that is run on a single node. This is in contrast to the case where a single job is run on many nodes in an interconnected fashion, generally using MPI to communicate in between individual processes. If you are running the same program on the cluster as you would on your desktop, chances are you will want to use a serial job.

  1. Log into cluster.cbi.utsa.edu with "ssh username@cluster.cbi.utsa.edu"
  2. Navigate to the directory where the code you wish to run lives.
  3. Make a job file which contains all the flags and information to submit your script. This is optional, but it is nice because then if something breaks you can give this file to the admin and they can help you debug it. I name these files program.job to keep their purpose clear, but you can call them whatever you like.

    #---------------------------Start program.job------------------------
    #!/bin/bash

    # The name of the job, can be whatever makes sense to you
    #$ -N ProgramName_EstimatedHoursToFinish_MiscComment

    # The job should be placed into the queue 'all.q'.
    #$ -q all.q

    # Redirect output stream to this file.
    #$ -o sge_output.dat

    # Redirect error stream to this file.
    #$ -e sge_error.dat

    # The batchsystem should use the current directory as working directory.
    # Both files (output.dat and error.dat) will be placed in the current
    # directory. The batchsystem assumes to find the executable in this directory.
    #$ -cwd

    # This is my email address for notifications. I want to have all notifications
    # at the master node of this cluster.
    #$ -M username@domain.name

    # Send me an email when the job is finished.
    #$ -m e

    # This is the file to be executed.
    echo $PATH

    time genesis ./CLIMB9.g > ./sge.out
    #---------------------------End program.sge------------------------

  4. Submit your job to the queue using "qsub programname.job"
  5. Check your job status using "qstat"

Running Parallel Jobs with SGE

A parallel job is where a single job is run on many nodes in an interconnected fashion, generally using MPI to communicate in between individual processes. If you are running the same program on the cluster as you would on your desktop, chances are you will want to use a serial job, not a parallel job. Parallel jobs generally are only for specially designed programs which will only work on machines with cluster management software installed.

  1. Log into cluster.cbi.utsa.edu with "ssh username@cluster.cbi.utsa.edu"
  2. Navigate to the directory where the code you wish to run lives.
  3. Make a job file which contains all the flags and information to submit your script. This is optional, but it is nice because then if something breaks you can give this file to the admin and they can help you debug it. I name these files program.job to keep their purpose clear, but you can call them whatever you like.

    #---------------------------Start programname.job------------------------
    #!/bin/bash

    # The job should be placed into the queue 'all.q'.
    #$ -q all.q

    # The name of the job, can be whatever makes sense to you
    #$ -N ProgramName_EstimatedHoursToFinish_MiscComment

    # Redirect output stream to this file.
    #$ -o sge_output.dat

    # Redirect error stream to this file.
    #$ -e sge_error.dat

    # The batchsystem should use the current directory as working directory.
    # Both files (output.dat and error.dat) will be placed in the current
    # directory. The batchsystem assumes to find the executable in this directory.
    #$ -cwd

    # This is my email address for notifications. I want to have all notifications
    # at the master node of this cluster. This is optional.
    #$ -M yourusername@youremaildomain.com

    # Send me an email when the job is finished.
    #$ -m e

    # Use the parallel environment "lam", which assigns two processes
    # to one host. In this example, if there are not enough machines to run the
    # mpi job on 120 processors the batchsystem can also use fewer than 120 but
    # the job should not run on fewer than 30 processors.
    #$ -pe lam 30-120

    # This is the file to be executed.
    echo $PATH

    mpirun -np $NSLOTS ./eqn10p.x
    #---------------------------End programname.job------------------------

  4. Submit your job to the queue using "qsub programname.job"
  5. Check your job status using "qstat"

SGE Environment Options
Environment Variables
When a Sun Grid Engine job is run, a number of variables are preset into the job’s
script environment, as listed below.

  • ARC - The Sun Grid Engine architecture name of the node on which the job is
    running; the name is compiled-in into the sge_execd binary

  • COMMD_PORT - Specifies the TCP port on which sge_commd(8) is expected to
    listen for communication requests

  • SGE_ROOT - The Sun Grid Engine root directory as set for sge_execd before
    start-up or the default /usr/SGE

  • SGE_CELL - The Sun Grid Engine cell in which the job executes
  • SGE_JOB_SPOOL_DIR - The directory used by sge_shepherd(8) to store jobrelated
    data during job execution

    #!/bin/csh
    #Force csh if not Sun Grid Engine default shell
    #$ -S /bin/csh
    # This is a sample script file for compiling and
    # running a sample FORTRAN program under Sun Grid Engine.
    # We want Sun Grid Engine to send mail when the job begins
    # and when it ends.
    #$ -M EmailAddress
    #$ -m b,e
    # We want to name the file for the standard output
    # and standard error.
    #$ -o flow.out -j y
    # Change to the directory where the files are located.
    cd TEST
    # Now we need to compile the program 'flow.f' and
    # name the executable 'flow'.
    f77 flow.f -o flow
    # Once it is compiled, we can run the program.
    flow

  • SGE_O_HOME - The home directory path of the job owner on the host from which
    the job was submitted

  • SGE_O_HOST - The host from which the job was submitted
  • SGE_O_LOGNAME - The login name of the job owner on the host from which the
    job was submitted

  • SGE_O_MAIL - The content of the MAIL environment variable in the context of the
    job submission command

  • SGE_O_PATH - The content of the PATH environment variable in the context of the
    job submission command

  • SGE_O_SHELL - The content of the SHELL environment variable in the context of
    the job submission command

  • SGE_O_TZ - The content of the TZ environment variable in the context of the job
    submission command

  • SGE_O_WORKDIR - The working directory of the job submission command
  • SGE_CKPT_ENV - Specifies the checkpointing environment (as selected with the
    qsub -ckpt option) under which a checkpointing job executes

  • SGE_CKPT_DIR - Only set for checkpointing jobs; contains path ckpt_dir (see
    the checkpoint manual page) of the checkpoint interface

  • SGE_STDERR_PATH - The path name of the file to which the standard error
    stream of the job is diverted; commonly used for enhancing the output with error
    messages from prolog, epilog, parallel environment start/stop or checkpointing
    scripts

  • SGE_STDOUT_PATH - The path name of the file to which the standard output
    stream of the job is diverted; commonly used for enhancing the output with
    messages from prolog, epilog, parallel environment start/stop or checkpointing
    scripts

  • SGE_TASK_ID - The task identifier in the array job represented by this task
  • ENVIRONMENT - Always set to BATCH; this variable indicates that the script is run
    in batch mode

  • HOME - The user’s home directory path from the passwd file
  • HOSTNAME - The host name of the node on which the job is running
  • JOB_ID - A unique identifier assigned by the sge_qmaster when the job was
    submitted; the job ID is a decimal integer in the range to 99999

  • JOB_NAME - The job name, built from the qsub script filename, a period, and the
    digits of the job ID; this default may be overwritten by qsub -N

  • LOGNAME - The user’s login name from the passwd file
  • NHOSTS - The number of hosts in use by a parallel job
  • NQUEUES - The number of queues allocated for the job (always 1 for serial jobs)
  • NSLOTS - The number of queue slots in use by a parallel job