Power on/off cluster

This should be used as a TEMPLATE GUIDE per HPC. Each HPC system can have variations that will change the order.

Power Off

Check status

Verify the status of the compute nodes and note which nodes are up/down.

CODE

$ cv-power -g nodes status

-g nodes refers to a node group created containing all compute nodes. If you do not have this group, are unsure if you do, or want to see a listing of available groups, run cv-nodenames -a --dump-groups to get a full listing of all nodes and devices and their associated groups.

Turn off compute nodes

Power off all nodes. This example is using the “group” directive. To find a list of groups and nodes, run cv-nodenames -a --dump-groups .

CODE

cv-power -g nodes shutdown

Turn off non-node, non-storage nodes

Typically, anything non-compute or non-storage node will be login nodes or, in some cases, the head node when the head node has no NFS storage mounts.

NOTE: Don’t do this if you are currently logged into this node… In the case where there is one login node that all other commands are run from, it should be the last node.

NOTE: Any system that uses dual-mirrored drives should be properly shut down with the ‘poweroff’ command. Other methods (eg: cv-power or init 0) may not properly unmount the MDADM RAID which can cause boot issues.

CODE

cv-power -n login01 shutdown

Turn off storage nodes

Lastly, turn off the storage nodes. In some clusters, this is the head node. In some, this is storage01, in others, there are multiple storage nodes. This example is using the “group” directive. To find a list of groups and nodes, run cv-nodenames -a --dump-groups .

NOTE: Any system that uses dual-mirrored drives should be properly shut down with the ‘poweroff’ command. Other methods (eg: cv-power or init 0) may not properly unmount the MDADM RAID which can cause boot issues.

CODE

cv-power -g storage shutdown

Turn off the appliance

This must be done by logging into the appliance and running:

CODE

cv-power -A -n cv01 shutdown

or by manually pressing the power button.

Power on

To power everything back on, reverse the order:

Start the appliance

Push the button on the appliance - wait until it has booted.

Start the storage

This may be dedicated storage nodes, or the head node.

Push the button on the storage node(s) or run:

CODE

cv-power -g storage on

Wait until it has booted, then login to storage node(s) the and take a look that it's happy and there are no obvious errors. Verify that the storage mounts are working

df -h
For ZFS zpool status
For beegfs beegfs-check-servers

Start the head and/or login nodes

Push the button on the head node and wait until it has booted. Log in and take a look that it's happy and there are no obvious errors. Then do the same on the login node(s).

Start the nodes

CODE

cv-power -g nodes on

That should power on all the nodes. But if any are having issues, you may just need to hit the power button on them all.

How to configure power on/off options

Each node can be configured to use one of these three methods.

Default

Ensure that the default method is selected.

Web GUI

Nodes and Devices → Nodes → Select the node to be edited → Edit → General → Default Power On/Off Method

CLI

CODE

$ cv-conf -e node.node_name
power_method: ipmi

SSH

This method requires the node to be online and the Linux operating system to be working. SSH keys must be working.

Web GUI

Users, Groups, and Roles → root User Trusted SSH Keys: Ensure the appliance, root, and any administrators have their keys listed.

CLI

CODE

$ cv-conf -e config.ssh
<edit>
trusted_root_ssh_keys:
- label: Trusted Admin user name here
  public_key: ssh-ed25519 AAAAC3N...long_string_here...ASD
    user@host.cluster

IPMI

This method requires a working IPMI/BMC network that has been properly configured.

Web GUI

Nodes and Devices → Nodes → Select the node to be edited → Edit → Networking → IPMI devices

CLI

CODE

$ cv-conf -e node.node_name
<edit>
ipmi:
  '1':
    bootproto: none
    cipher_suite: '17'
    enable: true
    gateway: 10.2.1.240
    hostname: ${_name}-ipmi
    hwaddr:node01: de:ad:be:ef:01:ab
    ipaddress: 10.2.1.${_index}
    ipmi_manageuser: true
    ipmi_password: S0m3P@55
    ipmi_userid: '5'
    ipmi_username: clustervisor
    netmask: 255.255.255.0
    switch_port: {}
    vlan: '2'

PDU

This requires a managed PDU (not just metered, but managed!) and that it has been configured.

Web GUI

Both the node and the PDU must be configured.

Node

Nodes and Devices → Nodes → Select the node to be edited → Edit → General → Power Supplies → Each PSU must be listed.

PDU

Rack Layout → Edit Layout / Connections → Edit Mode → Connections → Click on the PDU to be edited → Select the node PSU on the correct outlet.

CLI

Both the node and the PDU must be configured.

Node

CODE

$ cv-conf -e node.node_name
<edit>
powersupplies:
- description: Redundant 1200W 80+ Titanium PSU
  name: psu[01-02]

PDU

Export, edit, then import.

Individual node

CODE

$ cv-connections --dump node.node01 > node01.out
$ vim node01.out
[powersupplies]
psu01 = "pdu.pdu03:A1"
psu02 = "pdu.pdu04:A1"

$ cv-connections -i node.node01 -I node01.out

Entire PDU

CODE

$ cv-connections --dump pdu.pdu01 > pdu01.out
$ vim pdu01.out
<edit>
[outlets]
F42 = "node.node01:psu01"
$ cv-connections -i pdu.pdu01 -I pdu01.out