TORQUE

From ArchWiki
Revision as of 23:23, 14 June 2011 by Graysky (Talk | contribs) (Created page with "Torque is an open source resource manager providing control over batch jobs and distributed compute nodes. Basically, one can setup a home or small office Linux cluster and queu...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Torque is an open source resource manager providing control over batch jobs and distributed compute nodes. Basically, one can setup a home or small office Linux cluster and queue jobs with this software. A cluster consists of one head node and many compute nodes. The head node runs the torque-server daemon and the compute nodes run the torque-client daemon. The head node also runs a scheduler daemon.

Installation

Download and build Torque package from the AUR.

Setup

Must Haves

Make sure that Template:Filename on all PCs you plan to use for your cluster is setup to accept traffic from "trusted" LAN IP addresses. For example, a LAN based in the 192.168.0.* IP addresses could use the following:

tcp: 192.168.0.

Server (Head Node) Configuration

Follow these steps on the Head Node/Scheduler.

Edit Template:Filename to name the head node. It is recommended to match the hostname in Template:Filename for simplicity's sake.

Edit Template:Filename adding all compute nodes. Again, it is recommended to match the hostname(s) of the machines on your LAN. The syntax is HOSTNAME np=x gpus=y properties

  • HOSTNAME=the hostname of the machine
  • np=number of processors
  • gpus=number of gpus
  • properties=comments you wish to add

Only the hostname is required, all other fields are optional.

Example:

mars np=4
phobos np=2
deimos np=2
Note: One can run both the server and client on the same box.

Start the server and setup options. A minimal set is provided here. Adjust the first line substituting "mars" with the hostname entered in Template:Filename:

# rc.d start torque-server

As root:

set server acl_hosts = mars
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
set server scheduling = True
set server default_queue = batch
set server mom_job_sync = True
set server keep_completed = 300

Additional options can be specified as well. This is just a bare-bones set.

Client (Compute Node) Configuration

Follow these steps on each compute node in your cluster.

Note: If running both the server and client on the same box, be sure to complete these steps as well for that machine as well as other pure clients on your cluster.

Edit Template:Filename to contain some basic info identifying the server:

$pbsserver      mars          # note: this is the hostname of the headnode
$logevent       255           # bitmap of which events to log

Restart the server and client(s)

That should be it. Restart the server and client(s):

# rc.d restart torque-server
# rc.d start torque-client 

Verifying Cluster Status

To check the status of your cluster, issue the following:

$ pbsnodes -a

Each node if up should indicate that it is ready to receive jobs echoing a state of free. If a node is not working, it will report a state of down.

Example output:

mars
     state = free
     np = 4
     ntype = cluster
     status = rectime=1308093495,varattr=,jobs=,state=free,netload=739590301,gres=,loadave=0.14,ncpus=4,physmem=4058904kb,availmem=3281868kb,totmem=4058904kb,idletime=4747,nusers=1,nsessions=6,sessions=16758 31596 31612 31616 31651 32147,uname=Linux simplicity 2.6.39-ck #1 SMP PREEMPT Sat Jun 11 06:20:38 EDT 2011 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

phobos
     state = free
     np = 2
     ntype = cluster
     status = rectime=1308093460,varattr=,jobs=,state=free,netload=1723634,gres=,loadave=0.13,ncpus=2,physmem=4019704kb,availmem=5822968kb,totmem=6116852kb,idletime=2809,nusers=2,nsessions=5,sessions=1544 1655 1683 1696 2342,uname=Linux mythtv 2.6.37-ck #1 SMP PREEMPT Sun Apr 3 17:16:35 EDT 2011 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0>/pre>