Skip to main content
Skip table of contents

Power on/off cluster

This should be used as a TEMPLATE GUIDE per HPC. Each HPC system can have variations that will change the order.

Power Off

Check status

Verify the status of the compute nodes and note which nodes are up/down.

CODE
$ cv-power -g nodes status 

-g nodes refers to a node group created containing all compute nodes. If you do not have this group, are unsure if you do, or want to see a listing of available groups, run cv-nodenames -a --dump-groups to get a full listing of all nodes and devices and their associated groups.

Turn off compute nodes

Power off all nodes. This example is using the “group” directive. To find a list of groups and nodes, run cv-nodenames -a --dump-groups .

CODE
cv-power -g nodes shutdown

Turn off non-node, non-storage nodes

Typically, anything non-compute or non-storage node will be login nodes or, in some cases, the head node when the head node has no NFS storage mounts.

NOTE: Don’t do this if you are currently logged into this node… In the case where there is one login node that all other commands are run from, it should be the last node.

NOTE: Any system that uses dual-mirrored drives should be properly shut down with the ‘poweroff’ command. Other methods (eg: cv-power or init 0) may not properly unmount the MDADM RAID which can cause boot issues.

CODE
cv-power -n login01 shutdown

Turn off storage nodes

Lastly, turn off the storage nodes. In some clusters, this is the head node. In some, this is storage01, in others, there are multiple storage nodes. This example is using the “group” directive. To find a list of groups and nodes, run cv-nodenames -a --dump-groups .

NOTE: Any system that uses dual-mirrored drives should be properly shut down with the ‘poweroff’ command. Other methods (eg: cv-power or init 0) may not properly unmount the MDADM RAID which can cause boot issues.

CODE
cv-power -g storage shutdown

Turn off the appliance

This must be done by logging into the appliance and running:

CODE
cv-power -A -n cv01 shutdown

or by manually pressing the power button.

Power on

To power everything back on, reverse the order:

Start the appliance

Push the button on the appliance - wait until it has booted.

Start the storage

This may be dedicated storage nodes, or the head node.

Push the button on the storage node(s) or run:

CODE
cv-power -g storage on

Wait until it has booted, then login to storage node(s) the and take a look that it's happy and there are no obvious errors. Verify that the storage mounts are working

  • df -h

  • For ZFS zpool status

  • For beegfs beegfs-check-servers

Start the head and/or login nodes

Push the button on the head node and wait until it has booted. Log in and take a look that it's happy and there are no obvious errors. Then do the same on the login node(s).

Start the nodes

CODE
cv-power -g nodes on 

That should power on all the nodes. But if any are having issues, you may just need to hit the power button on them all.

How to configure power on/off options

Each node can be configured to use one of these three methods.

Default

Ensure that the default method is selected.

Web GUI

Nodes and Devices → Nodes → Select the node to be edited → Edit → General → Default Power On/Off Method

DefaultPower.png

CLI

CODE
$ cv-conf -e node.node_name
power_method: ipmi

SSH

This method requires the node to be online and the Linux operating system to be working. SSH keys must be working.

Web GUI

Users, Groups, and Roles → root User Trusted SSH Keys: Ensure the appliance, root, and any administrators have their keys listed.

SSHKeys.png

CLI

CODE
$ cv-conf -e config.ssh
<edit>
trusted_root_ssh_keys:
- label: Trusted Admin user name here
  public_key: ssh-ed25519 AAAAC3N...long_string_here...ASD
    user@host.cluster

IPMI

This method requires a working IPMI/BMC network that has been properly configured.

Web GUI

Nodes and Devices → Nodes → Select the node to be edited → Edit → Networking → IPMI devices

IPMI.png

CLI

CODE
$ cv-conf -e node.node_name
<edit>
ipmi:
  '1':
    bootproto: none
    cipher_suite: '17'
    enable: true
    gateway: 10.2.1.240
    hostname: ${_name}-ipmi
    hwaddr:node01: de:ad:be:ef:01:ab
    ipaddress: 10.2.1.${_index}
    ipmi_manageuser: true
    ipmi_password: S0m3P@55
    ipmi_userid: '5'
    ipmi_username: clustervisor
    netmask: 255.255.255.0
    switch_port: {}
    vlan: '2'

PDU

This requires a managed PDU (not just metered, but managed!) and that it has been configured.

Web GUI

Both the node and the PDU must be configured.

Node

Nodes and Devices → Nodes → Select the node to be edited → Edit → General → Power Supplies → Each PSU must be listed.

PowerSupplies.png
PDU

Rack Layout → Edit Layout / Connections → Edit Mode → Connections → Click on the PDU to be edited → Select the node PSU on the correct outlet.

ConnectionsPSU.png

CLI

Both the node and the PDU must be configured.

Node
CODE
$ cv-conf -e node.node_name
<edit>
powersupplies:
- description: Redundant 1200W 80+ Titanium PSU
  name: psu[01-02]
PDU

Export, edit, then import.

Individual node
CODE
$ cv-connections --dump node.node01 > node01.out
$ vim node01.out
[powersupplies]
psu01 = "pdu.pdu03:A1"
psu02 = "pdu.pdu04:A1"

$ cv-connections -i node.node01 -I node01.out
Entire PDU
CODE
$ cv-connections --dump pdu.pdu01 > pdu01.out
$ vim pdu01.out
<edit>
[outlets]
F42 = "node.node01:psu01"
$ cv-connections -i pdu.pdu01 -I pdu01.out

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.