Clusters built by Advanced Clustering Technologies come with the ability to easily set compute nodes to be able to boot to our Breakin utility to stress test the machine. This is an easy way to test the node for hardware errors.
To set a compute node to boot to Breakin from the head node:
$ act_netboot -n <node hostname> --set=breakin
To see that the breakin image has been set:
$ act_netboot -n <node hostname> --list
Now you can reboot the machine so that it will automatically load breakin. You can easily do this from the head node as well:
$ act_powerctl -n <node hostname> reboot
Once the node is in breakin we recommend letting it run for 12-24 hours to thoroughly stress test the hardware. Any errors found while running will be displayed in red. If you are unsure of the meaning of an error you can send them to firstname.lastname@example.org.
Note: failid errors can be ignored and are common to see.
You can also SSH into the compute node while Breakin is running and you will get the same output as when hooking up a monitor. By default the user information is:
$ ssh ssh@<node hostname>
Below is an example of Breakin running and showing hard drive SMART errors on a node. The hdhealth tests are passing though since SMART errors are more often an indication that a drive will eventually fail and needs to be replaced.
When done, you can set the node back to localboot and reboot the machine:
$ act_netboot -n <node hostname> --set=localboot $ act_powerctl -n <node hostname> reboot