Build a CycleCloud with Slurm Cluster integrated MySQL and Scheduler HA environments

Introduction

In this guide, you can walk through the process of setting up a CycleCloud environment that seamlessly integrates with a Slurm Cluster. Additionally, by exploring the integration of MySQL with external NFS Server that allowing for high availability of the cluster schedulers.

Architecture

With the following architecture, you can build a CycleCloud with Slurm Cluster integrated MySQL and Scheduler HA environments.
All the resources are deployed in the same resource group, virtual network. The CycleCloud Cluster is deployed in the subnet snet-cc , the Slurm Cluster is deployed in the subnet snet-slurm , the MySQL Server is deployed in the subnet snet-mysql , the NFS Server is deployed in the subnet snet-anf .

The Slurm Cluster source for CycleCloud will be downloaded from the Azure-CycleCloud-Slurm GitHub repository.
alt text

Prerequisites

  1. Virtual network:

# Virtual Network Name Address Space
1 vnet-cc-slurm 10.0.0.0/16
  1. Subnets:

# Name Address NSG Delegation
1 snet-cc 10.0.0.0/24 nsg-snet-cc
2 snet-slurm 10.0.1.0/24 nsg-snet-slurm
3 snet-anf 10.0.2.0/24 nsg-snet-anf Microsoft.NetApp/volumes
4 snet-pe 10.0.3.0/24 nsg-snet-pe
5 snet-worker 10.0.4.0/24 nsg-snet-worker
6 snet-mysql 10.0.5.0/23 nsg-snet-mysql Microsoft.DBforMySQL/flexibleServers
  • Azure cli command to create a virtual network with subnets:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    az group create -l japaneast -n rg-jpe-cc-slurm-20230715

    az network nsg create -g rg-jpe-cc-slurm-20230715 -l japaneast -n nsg-snet-cc
    az network nsg create -g rg-jpe-cc-slurm-20230715 -l japaneast -n nsg-snet-slurm
    az network nsg create -g rg-jpe-cc-slurm-20230715 -l japaneast -n nsg-snet-anf
    az network nsg create -g rg-jpe-cc-slurm-20230715 -l japaneast -n nsg-snet-pe
    az network nsg create -g rg-jpe-cc-slurm-20230715 -l japaneast -n nsg-snet-worker
    az network nsg create -g rg-jpe-cc-slurm-20230715 -l japaneast -n nsg-snet-mysql

    az network vnet create \
    --resource-group rg-jpe-cc-slurm-20230715 \
    --name vnet-jpe-cc-slurm-20230715 \
    --address-prefixes 10.0.0.0/16 \
    --subnet-name snet-cc \
    --subnet-prefixes 10.0.0.0/24 \
    --network-security-group nsg-snet-cc

    az network vnet subnet create \
    --resource-group rg-jpe-cc-slurm-20230715 \
    --vnet-name vnet-jpe-cc-slurm-20230715 \
    --name snet-slurm \
    --address-prefixes 10.0.1.0/24 \
    --network-security-group nsg-snet-slurm

    az network vnet subnet create \
    --resource-group rg-jpe-cc-slurm-20230715 \
    --vnet-name vnet-jpe-cc-slurm-20230715 \
    --name snet-anf \
    --address-prefixes 10.0.2.0/24 \
    --network-security-group nsg-snet-anf \
    --delegations Microsoft.NetApp/volumes

    az network vnet subnet create \
    --resource-group rg-jpe-cc-slurm-20230715 \
    --vnet-name vnet-jpe-cc-slurm-20230715 \
    --name snet-pe \
    --address-prefixes 10.0.3.0/24 \
    --network-security-group nsg-snet-pe

    az network vnet subnet create \
    --resource-group rg-jpe-cc-slurm-20230715 \
    --vnet-name vnet-jpe-cc-slurm-20230715 \
    --name snet-worker \
    --address-prefixes 10.0.4.0/24 \
    --network-security-group nsg-snet-worker

    az network vnet subnet create \
    --resource-group rg-jpe-cc-slurm-20230715 \
    --vnet-name vnet-jpe-cc-slurm-20230715 \
    --name snet-mysql \
    --address-prefixes 10.0.5.0/24 \
    --network-security-group nsg-snet-mysql \
    --delegations Microsoft.DBforMySQL/flexibleServers

Deploy CycleCloud Cluster

  • Refer to Create a CycleCloud Cluster to create a CycleCloud Cluster.
  • After the CycleCloud Cluster is created, you can access the CycleCloud Web UI with the URL: http://cyclecloud-cluster-IP

Note: Before accessing the CycleCloud Web UI, you need to add an inbound rule to the NSG of the CycleCloud Cluster subnet. The inbound rules (HTTP/HTTPS) should allow access from your IP address to the CycleCloud Cluster subnet.

  • Create a Storage account for CycleCloud locker storage.
    (1) Create a storage account with private endpoint enabled for as a private CycleCloud locker storage.
    alt text

  • Add a subscription to CycleCloud Cluster

(1) Before adding a subscription, you need to enable the system assigned identity of CycleCloud Cluster VM.

(2) On VM settings, click Identity and click Azure role assignments to add a role assignment to the system assigned identity.
alt text

(3) Assign the subscription contributor role to the system assigned identity. Contributor role has a higher privilege level than CycleCloud required. In case of security concern, you can assign a lower privilege level role to the system assigned identity. Refer to CycleCloud Documentation to crate a custom role for CycleCloud.
alt text

(4) After adding a role assignment to the system assigned identity, you can add a subscription to CycleCloud Cluster. On the CycleCloud Web UI, click Add Subscription and input the storage created in previous step. Save as a CycleCloud subscription.
alt text

Prepare a NFS for Slurm cluster

As we’re building a high availability Scheduler environment, we need to prepare a NFS server for Slurm Cluster. The NFS server is used to store the Slurm configuration files and the Slurm state files. The NFS server can be a VM or a NFS service. In this case, we use Azure NetApp Files as a NFS service.

  • Create Azure NetApp Files

(1) Create a Azure NetApp Files service. Refer to Create a Azure NetApp Files service to create a Azure NetApp Files service and a capacity pool.

(2) Create two volumes in the capacity pool ( sched and shared ).

A. Set the Network features to Standard
B. Set the Protocol types to NFSv4.1
C. Set the Unix permissions to 0775

alt text

alt text

  • Write down the NFS mount point of the two volumes. The NFS mount point is used to mount the volumes to the Slurm Cluster VMs.

alt text

Create a MySQL Database Service

Create a MySQL Database Service. Refer to Create a MySQL Database Service to create a MySQL Flexible Server.

In the network setting, select the Private access (VNet Integration) as Connectivity method.

alt text

Prepare a VM image for Slurm Cluster

In this tutorial, we use OpenLogic CentOS 7.9 generation 2 from Marketplace as a base OS image. You can use your own VM image as a base OS image.

  • Create a VM from the base OS image. Login to the image and install the following packages.
    1
    2
    3
    4
    sudo -i
    yum update
    wget -P /tmp https://raw.githubusercontent.com/themorey/cyclecloud-scripts/main/slurm-install.sh
    bash /tmp/slurm-install.sh
    alt text

alt text

  • After installing the packages, run the following command to check if it works properly without error.

    1
    2
    which sinfo
    id slurm
  • Deprovision the VM, remove +user if you want to keep the user account.

    1
    waagent deprovision+user --force

    alt text

  • There is an issue that Scheduler VM not register to Azure DNS successfully, thus we need to do following step to mitigate the issue.

We need to this step after deprovision waagent and before create a VM image.

Check the /etc/sysconfig/network-scripts/ifcfg-eth0 file and remove DHCP_HOSTNAME=localhost.localdomain

alt text

  • Capture the VM to VM image. Refer to Create a VM image to create a VM image.

Prepare a Slurm Cluster project template for CycleCloud

  • Login to CycleCloud Server and initialize the CycleCloud Server.

    1
    cyclecloud initialize

    alt text

  • Login to CycleCloud Server and prepare Azure-Slurm 2.7.3 project

(1) Get the Azure-Slurm 2.7.3 project from GitHub.

1
2
mkdir slurm273
cyclecloud project fetch https://github.com/Azure/cyclecloud-slurm/releases/2.7.3 slurm273

(2) Build the project

1
2
3
4
5
6
7
cd slurm273
cyclecloud project build
sudo mkdir -p /opt/cycle_server/work/staging/projects/slurm/2.7.3
sudo mkdir -p /opt/cycle_server/work/staging/projects/slurm/blobs
sudo cp -r build/slurm/* /opt/cycle_server/work/staging/projects/slurm/2.7.3
sudo cp -r blobs/* /opt/cycle_server/work/staging/projects/slurm/blobs
sudo chown -R cycle_server:cycle_server /opt/cycle_server/work/staging

(3) Add following lines to ~/slurm273/specs/default/chef/site-cookbooks/slurm/recipes/login.rb for login server template

1
2
3
4
5
6
7
8
9
10
11
link '/etc/slurm/keep_alive.conf' do
to '/sched/keep_alive.conf'
owner "#{slurmuser}"
group "#{slurmuser}"
end

service 'munge' do
action [:enable, :restart]
end

include_recipe 'slurm::accounting'

alt text

  • Preload the MySQL certificate

(1) On CycleCloud Server, download the certificate to ~/slurm273/specs/default/chef/site-cookbooks/slurm/files/default folder.

1
2
cd ~/slurm273/specs/default/chef/site-cookbooks/slurm/files/default
wget https://dl.cacerts.digicert.com/DigiCertGlobalRootCA.crt.pem

alt text

(2) Edit the ~/slurm273/specs/default/chef/site-cookbooks/slurm/recipes/accounting.rb file with following changes.

A. Change the remote_file to cookbook_file
B. Comment out the original source line
C. Add a new source to DigiCertGlobalRootCA.crt.pem

1
2
3
4
5
6
7
cookbook_file '/etc/slurm/BaltimoreCyberTrustRoot.crt.pem' do
#source node[:slurm][:accounting][:certificate_url]
source 'DigiCertGlobalRootCA.crt.pem'
owner 'slurm'
group 'slurm'
mode 0644
end

alt text

  • Edit the Slurm Cluster template

(1) Edit the ~/slurm273/templates/slurm.txt file, change following lines.

1
2
3
4
[[[cluster-init slurm:default:2.7.3]]] 
[[[cluster-init slurm:scheduler:2.7.3]]]
[[[cluster-init slurm:login:2.7.3]]]
[[[cluster-init slurm:execute:2.7.3]]]

(2) Run the following command to set slurm.install to false.

1
sed -i '/slurm_version$/a \\tslurm.install = false' ~/slurm273/templates/slurm.txt
  • Upload the project files to CycleCloud locker storage.
1
2
3
cyclecloud locker list 
cyclecloud project upload "storagename" (in my case:
cc-slurm-20230715-storage)

Create a Slurm Cluster with CycleCloud

Login to CycleCloud Server, run the following command to create a Slurm Cluster.

1
cyclecloud import_cluster slurm-20230715 -c slurm -f ~/slurm273/templates/slurm.txt

alt text

Configure Slurm Cluster

  • Login to CycleCloud web UI, edit the Slurm Cluster settings

(1) Configure the Network Attached Storage to the NFS volume created in previous step. Where IP address is your NFS volume mount path.

alt text

(2) Configure the Advanced Settings to enable the Slurm HA Scheduler and MySQL HA Database.

(3) Set the Scheduler OS and HPC OS, HTC OS to custom image and set to VM image’s resource id.

alt text

(4) Uncheck Return Proxy and Public Head Node settings.

alt text

Now we have all the components ready, we can start to use Slurm Cluster.


Reference

  • GitHub - Azure CycleCloud Slurm
  • Quickstart - Install CycleCloud using the Marketplace image
  • Create a custom role and managed identity for CycleCloud
  • Create an NFS volume for Azure NetApp Files
  • Quickstart: Use the Azure portal to create an Azure Database for MySQL - Flexible Server
  • Remove machine specific information by deprovisioning or generalizing a VM before creating an image