Taking Compute Nodes Down for Maintenance
When taking your compute nodes down for any reason, it’s good to take that node out of any job queues in which it may be a member. Nodes coming up temporarily may start new jobs, only to be shut down again, killing the user’s job. Here’s how to safely pull a node out of service for the three most common schedulers our customers use.
Grid Engine:
Use qmod -d or -e to disable or enable, and queuename@hostname. You can use * for all queues on a host. Examples:
Disable:
qmod -d *@node01
Enable:
qmod -e *@node01
Slurm:
Modify the state with scontrol, specifying the node and the new state. You must provide a reason when disabling a node.
Disable:
scontrol update NodeName=node[02-04] State=DRAIN Reason="Cloning"
Enable:
scontrol update NodeName=node[02-04] State=RESUME
Torque:
The pbsnodes command is used to make a node unavailable/available in Torque.
Disable:
pbsnodes -o node05
Enable:
pbsnodes -r node05
There are a lot of control options for queues, hosts, and other objects within the three most common schedulers. These commands are a good way to get started with maintaining individual nodes while keeping the rest of your cluster in production.
Get More Tech Tips
Visit the Advanced Clustering Technologies Knowledge Base for more tech tips from our HPC engineers.