Skip to main content
Skip table of contents

Taking Compute Nodes Down for Maintenance

When taking your compute nodes down for any reason, it’s good to take that node out of any job queues in which it may be a member. Nodes coming up temporarily may start new jobs, only to be shut down again, killing the user’s job. Here’s how to safely pull a node out of service for the three most common schedulers our customers use.

Grid Engine:
Use qmod -d or -e to disable or enable, and queuename@hostname. You can use * for all queues on a host. Examples:
Disable:

CODE
qmod -d *@node01

Enable:

CODE
qmod -e *@node01


Slurm:
Modify the state with scontrol, specifying the node and the new state. You must provide a reason when disabling a node.
Disable:

CODE
scontrol update NodeName=node[02-04] State=DRAIN Reason="Cloning"

Enable:

CODE
scontrol update NodeName=node[02-04] State=RESUME


Torque:
The pbsnodes command is used to make a node unavailable/available in Torque.
Disable:

CODE
pbsnodes -o node05

Enable:

CODE
pbsnodes -r node05

There are a lot of control options for queues, hosts, and other objects within the three most common schedulers. These commands are a good way to get started with maintaining individual nodes while keeping the rest of your cluster in production.

Get More Tech Tips
Visit the Advanced Clustering Technologies Knowledge Base for more tech tips from our HPC engineers.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.