Why an Appliance?
When building out your cluster you may think, “Why do I also need a whole appliance just to manage my cluster?”
This is a valid question that ACT did not take lightly and put into much consideration. However, after 20 years of experience in HPC, there were several advantages to managing a cluster on a separate appliance that deemed it the best, with the ability to provide resiliency and ease of management with a layer of security in mind.
Resiliency
A common course of action for cluster management software is to install and manage the software from the head node or a login node. While this was ACT’s procedure in the past, issues around hardware failures, overactive users, and the growing need for more resources needed for both managing and operating a cluster caused instances of cluster management to become inaccessible. A head node holds a lot of responsibility for running key components of a cluster, the amount of which varies between institutions, but with the addition of a layer of monitoring, alerting, image management, configuration management, etc. the amount of physical resources required for the head node grows over time and places a lot of weight on a single point of failure. Should the head node go down due to hardware failure, become unresponsive due to a one more multiple users overloading resources, etc. not only do you lose the ability to utilize the cluster, you also lose insight into the cluster. If your monitoring server goes down, how do you get alerted?
While some institutions have the support of alerting from their internal IT teams, this is not available to all administrators. With monitoring and management functions shifted to an appliance, you have a load of operations removed from the head node, freeing it to perform its primary operational management functions, you maintain the ability to maintain insights and cluster health monitoring should inaccessibility of the head node occur, more space to focus on storage of node images, configuration management, and active and historical stats. Additionally, appliances are shipped with 2 NVMe drives in a RAID1 for resiliency should one hard drive fail.
Shifting these responsibilities to a separate appliance reduces your points of failure and the weight of any one failure.
Ease of Management
As mentioned when discussing resiliency, the head node of an HPC cluster already manages many important functions of a cluster. By moving cluster management and other non-compute related, yet important, services to an appliance, you can relieve the head node of extending resources to those services and have one location to maintain all things related to the configuration of the cluster. Examples of such services and resources include, but are not limited to: node image images, node information (i.e., serial numbers, IP addresses, MAC addresses, etc.), DHCP, DNS configuration, NTP, serial console access, node, scheduler, and service monitoring across the cluster, alerting, etc. Having all of these on an appliance gives you an outside perspective and separation from your cluster and gives you a better point of management for the cluster as a whole.
This also makes the loss or inaccessibility of the head node less catastrophic, as in a worst case scenario of a loss of the head node, another node can easily be promoted as the head node from an image in a short amount of time.
Security
The appliance is run as just that - an appliance. This means if you were to access the command line of the appliance you will be placed in a limited console with only the ability to run ClusterVisor related commands. Within ClusterVisor, administrators also have the ability to grant or limit access to users as much as desired.
ClusterVisor has a web UI which provides access to configuration management and monitoring. Administrators can create roles and allow permissions for additional administrators to have full or limited access and can disallow other users, allow users access to only view monitoring and statistics, or any other level desired.
The appliance provides a natural separation from normal head node or login node access to protect from inadvertent or unauthorized access and alterations of cluster configurations and management.
Making management of HPC clusters as seamless, easy, flexible, and reliable is the primary focus of ClusterVisor and ACT is able to best pursue this by utilizing the use of appliance to sit outside of normal cluster operations and let computing resources be free from management functions and focus their resources and computing power on HPC operations.