Taking Compute Nodes Down for Maintenance

When taking your compute nodes down for any reason, it’s good to take that node out of any job queues in which it may be a member. Nodes coming up temporarily may start new jobs, only to be shut down again, killing the user’s job. Here’s how to safely pull a node out of service for the three most common schedulers our customers use.

Grid Engine:
Use qmod -d or -e to disable or enable, and queuename@hostname. You can use * for all queues on a host. Examples:
Disable:

CODE

qmod -d *@node01

Enable:

CODE

qmod -e *@node01

Slurm:
Modify the state with scontrol, specifying the node and the new state. You must provide a reason when disabling a node.
Disable:

CODE

scontrol update NodeName=node[02-04] State=DRAIN Reason="Cloning"

Enable:

CODE

scontrol update NodeName=node[02-04] State=RESUME

Torque:
The pbsnodes command is used to make a node unavailable/available in Torque.
Disable:

CODE

pbsnodes -o node05

Enable:

CODE

pbsnodes -r node05

There are a lot of control options for queues, hosts, and other objects within the three most common schedulers. These commands are a good way to get started with maintaining individual nodes while keeping the rest of your cluster in production.

Get More Tech Tips
Visit the Advanced Clustering Technologies Knowledge Base for more tech tips from our HPC engineers.