High Performance Computing

SGE commands, queues and resources

Commands

To query a job use the qstat command:

> qstat
job-ID prior name user state submit/start at queue slots ja-t
698 0.00000 test.sh egzcb qw 07/12/2008 08:57:35 1
This shows that the job is queued and waiting (qw) and has been given a Job ID of 698. Later on it will will be running:
~/benchmarks> qstat
job-ID prior name user state submit/start at queue slots ja-t
698 0.55500 test.sh egzcb r 07/12/2008 08:57:44 serial.q@compE000
This shows that the job was accepted by the serial.q queue and is actually running (r) on node compE000.

If there are a lot of jobs on the cluster, the output from qstat can be difficult to read. You can limit output to that for your own jobs by use of the '-u' option, for example:

> qstat -u egzcb
This will show only jobs owned by user egzcb.

If the job has finished it will disappear from the qstat output. By default the standard output and error from a job are redirected to files which have the same name as the job script appended with a .o and .e respectively plus the Job ID number. This can be modified with the -o and -e flags to the qsub command. If you want the error and output to appear in the same file then use the -j y flags. In the above example, the output file is as follows:

> cat test.sh.o698 This script is running on node compE000
The date is
Sat Jul 12 08:57:44 BST 2008

If you want to remove a queued or running job from the job queue, use the qdel command followed by the Job ID number - eg

> qsub test.sh
Your job 700 ("test.sh") has been submitted >

qdel 700
egzcb has deleted job 700

Queues and Resources

The Nottingham cluster has a number of queues installed, but users do not usually have to select which queue to use, as this is done automatically by the scheduler. Before submitting jobs, users are advised to check which towers are most heavily used by use of the 'jupiteruse' script. Selection of a tower is done by the 'module switch' command (see Environments section for further details).

It is always advisable to specify the maximum time your job is likely to take, as this helps the scheduler run jobs more efficiently. For example, to submit a job which requires 2 hours to complete, submit with:

qsub -l h_rt=7200 myjob.sh
It is possible to specify a time limit when using the wrapper scripts by using the QSUB_OPTIONS environment variable, e.g:
export QSUB_OPTIONS="-l h_rt=3600"
ompisub myprogram
Times can be specified a seconds, or in an Hours:Minutes:Seconds format. Note that if your job exceeds the time limit specified, it will be immediately terminated by SGE.

Use of the time resource is essential to take advantage of short queues which are available on towers E and C, which both have compute nodes reserved for jobs taking less than 12 hours. Tower E also has compute nodes available for jobs taking less than 1 hour.

Towers B and E have some nodes with larger amounts of memory. These can be requested by specifying a resource requirement, which is then used by SGE to select suitable compute nodes to run the job, e.g for tower B, most nodes have 4GB, and six with 8GB. These can be specified as follows:

qsub -l 4G=true,h_rt=24:00:00 myjob.sh
qsub -l 8G=true,h_rt=24:00:00 myjob.sh
Note that these examples also include a time limit of 24 hours, in addition to the memory specification

For towerE, the relevant resource is 'bigmem', e.g (with time limit of 72 hours also included):

qsub -l bigmem=true,h_rt=72:00:00 myjob.sh
This should be only used for jobs which demand large amounts of memory (> 2GB per process), as the number of high memory nodes is limited.