Adding new nodes to an existing cluster

The following steps apply if you are adding in new nodes to your cluster and these nodes will be cloned from your existing nodes image.

First edit /act/etc/act_nodes.conf and add your new node definitions below the existing node definitions. If you do not have these already they can be provided by ACT support.

Next edit /act/etc/act_util.conf

CODE

$ vi /act/etc/act_util.conf

Look for the [node] section:

CODE

[node] 
type=range 
start=1 
end=10

The idea is that you will increase the end of any range value by the number of nodes that you are adding. For example, If you had 10 nodes, and are adding 8 more, increase ’10’ to ’18’.

The type of lines to look for are as follows:

CODE

end= dev[eth0]_ipend= 
dev[ipmi]_ ipend= 
If you have InfiniBand: 
dev[ib0]_ipend

Regenerate all the appropriate configuration files by running this command:

CODE

$ /act/bin/act_cfgfile --hosts --ssh --cloner --dhcp --prefix=/

Restart DHCP since new hosts were added:

CODE

$ service dhcpd restart

Copy the new hosts and known_hosts files to all the nodes:

CODE

$ /act/bin/act_cp -a /etc/hosts
$ /act/bin/act_cp -a /etc/ssh/ssh_known_hosts2

Log in to node01 as root and run the following in order to update your compute node image:

CODE

$ /act/cloner/bin/cloner --server=head --image=node

Back on the head node, run the following. Replace with the names and range of your new nodes:

CODE

$ /act/bin/act_netboot -r node11-node18 --set=cloner3

When the new nodes are turned on they will network boot, install their OS, and reboot when completed. ** Once the new nodes are up and accessible continue to the next steps. **

Synchronize the clocks on the entire cluster:

CODE

$ act_exec -a 'service ntpd stop; ntpdate 1.centos.pool.ntp.org; hwclock --systohc ; service ntpd start'

The following commands use the information in act_util.conf to set the IPMI IP address and network settings on the new nodes. Replace with the names and range of your new nodes:

CODE

$ act_exec -r node11-node18 “service ipmi start” 
$ act_ipmi_netcfg -r node11-node18 
$ act_ipmi_netcfg -a --dump_dhcp > /etc/dhcpd.d/ipmi.conf 
$ service dhcpd restart 
$ act_exec -r node11-node18 “service ipmi stop” 
$ act_ipmi_log -a setdate

If you are using SGE, Sun Grid Engine, for your job scheduler

To add the new compute nodes to the SGE queueing system, run the following commands, and follow the direction with each step:

CODE

$ qconf -mhgrp @allhosts

— add an entry for each new host that you are adding

CODE

$ qconf -ae <hostname>

— add an exec host entry for each new host that you are adding
— this opens a file editor
— set ‘hostname’ to the new hostname
— set ‘complex_values’ to ‘slots=#’ where # is the # of CPU cores in that system

CODE

$ for i in `act_nodenames -r node11-node18`; do qconf -ah $i; done

— add an administrative host entry for each new host that you are adding

CODE

$ for i in `act_nodenames -r node11-node18`; do qconf -as $i; done

—add an submit host entry for each new host that you are adding Each host has to have a configuration file added in for it. We can create a config file for each of the new nodes from one of the already configured nodes.

CODE

$ qconf -sconf <existing hostname> > <new hostname>

So for our example above we can do the following:

CODE

$ mkdir /tmp/sge; cd /tmp/sge$ for i in `act_nodenames -r node11-node18`; do qconf -sconf node01 > $i; done$ for i in `act_nodenames -r node11-node18`; do qconf -Aconf $i; done

(Note: this is creating a file for each hostname within the current working directory, /tmp/sge)

If you are using Torque for your job scheduler
To add in the new compute nodes to the Torque scheduler edit the nodes list

CODE

$ vi /var/spool/torque/server_priv/nodes

— Add an entry line for each new compute node Next restart the pbs_server and pbs_sched services

CODE

$ /etc/init.d/pbs_server restart $ /etc/init.d/pbs_sched restart

If you are using SLURM for your job scheduler
To add the new compute nodes to SLURM, run the following commands and follow the directions with each step:

For GPU nodes, create the file gres.conf in /act/slurm

CODE

cd /act/slurm
vi gres.conf

And add a line for each type of GPU node.

CODE

NodeName=node[17-18] Name=gpu Type=kepler File=/dev/nvidia0

Then for the GPU and all other nodes, add them to slurm.conf

CODE

$ vi /act/slurm/slurm.conf

At the bottom, extend the NodeName= to include the additional nodes or add a new line if the nodes are different.

CODE

NodeName=node[01-16] CPUs=16 RealMemory=128000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN
NodeName=node[17-18] CPUs=16 RealMemory=128000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Gres=gpu:kepler:1 State=UNKNOWN
PartitionName=batch Nodes=node[01-16] Default=YES MaxTime=30-0:00:00 State=UP QOS=batch DefMemPerCPU=8000

Then from the head node, restart the services.

CentOS/EL6

CODE

$ service slurmdbd restart
$ chkconfig --add slurmdbd
$ service slurmctld restart
$ scontrol reconfigure

CentOS/EL7

CODE

$ systemctl restart slurmdbd
$ systemctl restart slurmctld
$ scontrol reconfigure

Enable and start the slurm daemon on the new compute nodes.

CentOS/EL6

CODE

$ act_exec -r node11-node18 service slurm start
$ act_exec -r node11-node18 chkconfig slurm on

CentOS/EL7

CODE

$ act_exec -r node11-node18 systemctl start slurmd.service
$ act_exec -r node11-node18 systemctl enable slurmd.service