Adding new nodes to an existing cluster
The following steps apply if you are adding in new nodes to your cluster and these nodes will be cloned from your existing nodes image.
First edit /act/etc/act_nodes.conf and add your new node definitions below the existing node definitions. If you do not have these already they can be provided by ACT support.
Next edit /act/etc/act_util.conf
$ vi /act/etc/act_util.conf
Look for the [node] section:
[node]
type=range
start=1
end=10
The idea is that you will increase the end of any range value by the number of nodes that you are adding. For example, If you had 10 nodes, and are adding 8 more, increase ’10’ to ’18’.
The type of lines to look for are as follows:
end= dev[eth0]_ipend=
dev[ipmi]_ ipend=
If you have InfiniBand:
dev[ib0]_ipend
Regenerate all the appropriate configuration files by running this command:
$ /act/bin/act_cfgfile --hosts --ssh --cloner --dhcp --prefix=/
Restart DHCP since new hosts were added:
$ service dhcpd restart
Copy the new hosts and known_hosts files to all the nodes:
$ /act/bin/act_cp -a /etc/hosts
$ /act/bin/act_cp -a /etc/ssh/ssh_known_hosts2
Log in to node01 as root and run the following in order to update your compute node image:
$ /act/cloner/bin/cloner --server=head --image=node
Back on the head node, run the following. Replace with the names and range of your new nodes:
$ /act/bin/act_netboot -r node11-node18 --set=cloner3
When the new nodes are turned on they will network boot, install their OS, and reboot when completed. ** Once the new nodes are up and accessible continue to the next steps. **
Synchronize the clocks on the entire cluster:
$ act_exec -a 'service ntpd stop; ntpdate 1.centos.pool.ntp.org; hwclock --systohc ; service ntpd start'
The following commands use the information in act_util.conf to set the IPMI IP address and network settings on the new nodes. Replace with the names and range of your new nodes:
$ act_exec -r node11-node18 “service ipmi start”
$ act_ipmi_netcfg -r node11-node18
$ act_ipmi_netcfg -a --dump_dhcp > /etc/dhcpd.d/ipmi.conf
$ service dhcpd restart
$ act_exec -r node11-node18 “service ipmi stop”
$ act_ipmi_log -a setdate
If you are using SGE, Sun Grid Engine, for your job scheduler
To add the new compute nodes to the SGE queueing system, run the following commands, and follow the direction with each step:
$ qconf -mhgrp @allhosts
— add an entry for each new host that you are adding
$ qconf -ae <hostname>
— add an exec host entry for each new host that you are adding
— this opens a file editor
— set ‘hostname’ to the new hostname
— set ‘complex_values’ to ‘slots=#’ where # is the # of CPU cores in that system
$ for i in `act_nodenames -r node11-node18`; do qconf -ah $i; done
— add an administrative host entry for each new host that you are adding
$ for i in `act_nodenames -r node11-node18`; do qconf -as $i; done
—add an submit host entry for each new host that you are adding Each host has to have a configuration file added in for it. We can create a config file for each of the new nodes from one of the already configured nodes.
$ qconf -sconf <existing hostname> > <new hostname>
So for our example above we can do the following:
$ mkdir /tmp/sge; cd /tmp/sge$ for i in `act_nodenames -r node11-node18`; do qconf -sconf node01 > $i; done$ for i in `act_nodenames -r node11-node18`; do qconf -Aconf $i; done
(Note: this is creating a file for each hostname within the current working directory, /tmp/sge)
If you are using Torque for your job scheduler
To add in the new compute nodes to the Torque scheduler edit the nodes list
$ vi /var/spool/torque/server_priv/nodes
— Add an entry line for each new compute node Next restart the pbs_server and pbs_sched services
$ /etc/init.d/pbs_server restart $ /etc/init.d/pbs_sched restart
If you are using SLURM for your job scheduler
To add the new compute nodes to SLURM, run the following commands and follow the directions with each step:
For GPU nodes, create the file gres.conf in /act/slurm
cd /act/slurm
vi gres.conf
And add a line for each type of GPU node.
NodeName=node[17-18] Name=gpu Type=kepler File=/dev/nvidia0
Then for the GPU and all other nodes, add them to slurm.conf
$ vi /act/slurm/slurm.conf
At the bottom, extend the NodeName= to include the additional nodes or add a new line if the nodes are different.
NodeName=node[01-16] CPUs=16 RealMemory=128000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN
NodeName=node[17-18] CPUs=16 RealMemory=128000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Gres=gpu:kepler:1 State=UNKNOWN
PartitionName=batch Nodes=node[01-16] Default=YES MaxTime=30-0:00:00 State=UP QOS=batch DefMemPerCPU=8000
Then from the head node, restart the services.
CentOS/EL6
$ service slurmdbd restart
$ chkconfig --add slurmdbd
$ service slurmctld restart
$ scontrol reconfigure
CentOS/EL7
$ systemctl restart slurmdbd
$ systemctl restart slurmctld
$ scontrol reconfigure
Enable and start the slurm daemon on the new compute nodes.
CentOS/EL6
$ act_exec -r node11-node18 service slurm start
$ act_exec -r node11-node18 chkconfig slurm on
CentOS/EL7
$ act_exec -r node11-node18 systemctl start slurmd.service
$ act_exec -r node11-node18 systemctl enable slurmd.service