Torque is an open source resource manager providing control over batch jobs and distributed compute nodes. Basically, one can setup a home or small office Linux cluster and queue jobs with this software. A cluster consists of one head node and many compute nodes. The head node runs the torque-server daemon and the compute nodes run the torque-client daemon. The head node also runs a scheduler daemon.
Download and build Torque package from the AUR.
Make sure that Template:Filename on all PCs you plan to use for your cluster is setup to accept traffic from "trusted" LAN IP addresses. For example, a LAN based in the 192.168.0.* IP addresses could use the following:
Server (Head Node) Configuration
Follow these steps on the Head Node/Scheduler.
Edit Template:Filename adding all compute nodes. Again, it is recommended to match the hostname(s) of the machines on your LAN. The syntax is HOSTNAME np=x gpus=y properties
- HOSTNAME=the hostname of the machine
- np=number of processors
- gpus=number of gpus
- properties=comments you wish to add
Only the hostname is required, all other fields are optional.
mars np=4 phobos np=2 deimos np=2
Start the server and setup options. A minimal set is provided here. Adjust the first line substituting "mars" with the hostname entered in Template:Filename:
# rc.d start torque-server
set server acl_hosts = mars set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True set server scheduling = True set server default_queue = batch set server mom_job_sync = True set server keep_completed = 300
Additional options can be specified as well. This is just a bare-bones set.
Client (Compute Node) Configuration
Follow these steps on each compute node in your cluster.
Edit Template:Filename to contain some basic info identifying the server:
$pbsserver mars # note: this is the hostname of the headnode $logevent 255 # bitmap of which events to log
Restart the server and client(s)
That should be it. Restart the server and client(s):
# rc.d restart torque-server # rc.d start torque-client
Verifying Cluster Status
To check the status of your cluster, issue the following:
$ pbsnodes -a
Each node if up should indicate that it is ready to receive jobs echoing a state of free. If a node is not working, it will report a state of down.
mars state = free np = 4 ntype = cluster status = rectime=1308093495,varattr=,jobs=,state=free,netload=739590301,gres=,loadave=0.14,ncpus=4,physmem=4058904kb,availmem=3281868kb,totmem=4058904kb,idletime=4747,nusers=1,nsessions=6,sessions=16758 31596 31612 31616 31651 32147,uname=Linux simplicity 2.6.39-ck #1 SMP PREEMPT Sat Jun 11 06:20:38 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 phobos state = free np = 2 ntype = cluster status = rectime=1308093460,varattr=,jobs=,state=free,netload=1723634,gres=,loadave=0.13,ncpus=2,physmem=4019704kb,availmem=5822968kb,totmem=6116852kb,idletime=2809,nusers=2,nsessions=5,sessions=1544 1655 1683 1696 2342,uname=Linux mythtv 2.6.37-ck #1 SMP PREEMPT Sun Apr 3 17:16:35 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0>/pre>