Introduction

This guide walks through setting up a secured (lockdown) Azure CycleCloud environment integrated with a Slurm cluster.

Architecture Overview

The setup involves a fully integrated environment where:

  • CycleCloud cluster is deployed in snet-cc
  • Slurm nodes are deployed in snet-worker, which is isolated from the internet.

This setup uses:

  • Azure CycleCloud version 8.7
  • Azure CycleCloud-Slurm version 3.0.5
  • Slurm version 23.02.06-1

Prerequisites

Virtual Network and Subnets

(1) Virtual Network

  • Name: vnet-cc-slurm
  • Address space: 10.0.0.0/16

(2) Subnets & NSGs

Name Address Prefix NSG
snet-cc 10.0.0.0/24 nsg-snet-cc
snet-slurm 10.0.1.0/24 nsg-snet-slurm
snet-pe 10.0.3.0/24 nsg-snet-pe
snet-worker 10.0.4.0/24 nsg-snet-worker
  • Configure snet-worker with outbound internet access denied for lockdown simulation.

Azure CLI Setup

Create NSGs and subnets using the provided CLI snippet (see full guide above for commands).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
rg=rg-jpe-cc-slurm-0717
loc=japaneast
vnet=vnet-jpe-cc-slurm-0717

az group create -l $loc -n rg-jpe-cc-slurm-0717
az network nsg create -g $rg -l $loc -n nsg-snet-cc
az network nsg create -g $rg -l $loc -n nsg-snet-slurm
az network nsg create -g $rg -l $loc -n nsg-snet-pe
az network nsg create -g $rg -l $loc -n nsg-snet-worker

az network vnet create \
--resource-group $rg \
--name $vnet \
--address-prefixes 10.0.0.0/16 \
--subnet-name snet-cc \
--subnet-prefixes 10.0.0.0/24 \
--network-security-group nsg-snet-cc

az network vnet subnet create \
--resource-group $rg \
--vnet-name $vnet \
--name snet-slurm \
--address-prefixes 10.0.1.0/24 \
--network-security-group nsg-snet-slurm

az network vnet subnet create \
--resource-group $rg \
--vnet-name $vnet \
--name snet-anf \
--address-prefixes 10.0.2.0/24 \
--network-security-group nsg-snet-anf \
--delegations Microsoft.NetApp/volumes

az network vnet subnet create \
--resource-group $rg \
--vnet-name $vnet \
--name snet-pe \
--address-prefixes 10.0.3.0/24 \
--network-security-group nsg-snet-pe

az network vnet subnet create \
--resource-group $rg \
--vnet-name $vnet \
--name snet-worker \
--address-prefixes 10.0.4.0/24 \
--network-security-group nsg-snet-worker

az network vnet subnet create \
--resource-group $rg \
--vnet-name $vnet \
--name snet-mysql \
--address-prefixes 10.0.5.0/24 \
--network-security-group nsg-snet-mysql \
--delegations Microsoft.DBforMySQL/flexibleServers

Deploying the CycleCloud Cluster

(1) Deploy CycleCloud using the official documentation. Access Web UI via http:///. Ensure NSG allows inbound HTTP/HTTPS from your IP.

(2) Create a private storage account for the locker with private endpoint enabled.

Storage account >> Security+ networking >> Networking >> "Private endpoint connections" tab

(3) Create a User Assigned Managed Identity.
Grant Storage Blob Data Reader role on the storage account.

(4) Enable System Assigned Managed Identity on CycleCloud VM. Assign Contributor role to the CycleCloud VM’s System Assigned Managed Identity on the Subscription.

Security >> Identity >> "System assigned" tab >> "Azure role assignments" button


(5) Add your subscription to CycleCloud via Web UI.

Building the Slurm Cluster Template

Configure CycleCloud Project on CycleCloud Server

(1) SSH to the CycleCloud server VM and run the following commands

1
cyclecloud initialize

(2) Fetch and Build Slurm Project

1
2
3
cyclecloud project fetch https://github.com/Azure/cyclecloud-slurm/releases/3.0.5 slurm305
cd slurm305
cyclecloud project build

(3) Modify Install Script for Lockdown Environment
Edit rhel.sh under blobs/azure-slurm-install/ to include offline installation logic and required RPM repo (e.g., codeready-builder for perl-Switch).

1
2
3
cd blobs
tar zxvf azure-slurm-install-pkg-3.0.5.tar.gz
sudo vi azure-slurm-install/rhel.sh

Modify the following lines to the script:

1
2
3
4
5
if [ "$OS_VERSION" -gt "7" ]; then    
if ! rpm -q perl-Switch > /dev/null 2>&1; then
dnf -y --enablerepo=codeready-builder-for-rhel-8-x86_64-rhui-rpms install perl-Switch
fi
PACKAGE_DIR=slurm-pkgs-rhel8

(4) Repackage the install package

1
tar -czvf azure-slurm-install-pkg-3.0.5.tar.gz azure-slurm-install/

(5) Copy to CycleCloud Staging Area

1
2
3
cp -r build/* /opt/cycle_server/work/staging/projects/slurm/3.0.5
cp blobs/*.gz /opt/cycle_server/work/staging/projects/slurm/blobs
chown -R cycle_server:cycle_server /opt/cycle_server/work/staging

(6) Update Template & Upload
Edit templates/slurm.txt to reference the correct cluster-init blocks as below:

1
2
3
4
5
6
7
[azureuser@cycle-twn slurm305]$ grep cluster-init templates/slurm.txt
# May be used to identify the ID in cluster-init scripts
[[[cluster-init slurm:default:3.0.5]]]
[[[cluster-init slurm:scheduler:3.0.5]]]
[[[cluster-init slurm:login:3.0.5]]]
[[[cluster-init slurm:execute:3.0.5]]]
Description = "Specify the scheduling software, and base OS installed on all nodes, and optionally the cluster-init and chef versions from your locker."

(7) List the locker:

1
cyclecloud locker list

(8) Import the Slurm Template into CycleCloud

1
2
cyclecloud import_template slurm-305 -c slurm -f ~/slurm305/templates/slurm.txt
cyclecloud project upload sandbox-storage

Deploying the Slurm Cluster

(1) Create new Slurm cluster using the slurm-305 template.

(2) Configure the cluster:

  • Choose snet-worker for networking (lockdown simulation).

  • Use custom User Assigned Managed Identity.

  • Disable Return Proxy & Public Head Node under advanced networking.

Preparing the Custom VM Image for Slurm

Create a new RHEL 8.10 VM with Internet Access as a base image.

(1) Enable cyclecloud repo manually

1
2
3
4
5
6
7
8
9
sudo -i
cat <<EOF | sudo tee /etc/yum.repos.d/cyclecloud.repo
[cyclecloud]
enabled=0
name=cyclecloud
baseurl=https://packages.microsoft.com/yumrepos/cyclecloud
gpgcheck=1
gpgkey=https://packages.microsoft.com/keys/microsoft.asc
EOF

(2) Download all necessary RPMs and install the required packages

1
2
3
4
5
6
7
8
9
yum update
mkdir -p /usr/slurm-rpms
cd /usr/slurm-rpms

wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

dnf download --resolve munge munge-devel munge-libs hwloc numactl perl perl-JSON perl-Data-Dumper hostname bash-completion python3 python3-pip gcc make unzip tar wget vim curl mariadb-server mariadb which man-db libyaml libibverbs libibmad libibumad nfs-utils createrepo hicolor-icon-theme graphite2 jbigkit-libs cups-libs libXcursor libXinerama libXdamage perl-Switch libthai atk libXcomposite http-parser libXi libXft libXrandr gtk2 pango mysql-common mysql-libs

dnf install /usr/slurm-rpms/*.rpm --nogpgcheck

(3) Configure as local yum repo with createrepo

1
2
3
4
5
6
7
8
9
10
11
createrepo .
cat <<EOF | sudo tee /etc/yum.repos.d/local-slurm.repo
[local-slurm]
name=Local Slurm RPMs
baseurl=file:///usr/slurm-rpms
enabled=1
gpgcheck=0
EOF

dnf clean all
dnf makecache

(4) Offline Python Environment
Download and extract a prebuilt Python virtual environment to /opt/azurehpc/slurm/.

The offline Python environment can be created using a Slurm Scheduler VM with internet access, then copied over. If you cannot access the internet, you download the tar file from here: http://tiny.cc/enlp001

1
tar -xzf slurm_venv.tar.gz -C /opt/azurehpc/slurm

(5) Configure Firewall (for NFS)

1
2
3
4
firewall-cmd --permanent --add-service=nfs
firewall-cmd --permanent --add-service=rpc-bind
firewall-cmd --permanent --add-service=mountd
firewall-cmd --reload

(6) Prepare VM for Imaging

1
waagent --deprovision+user --force

(7) Capture image and obtain image ID.


(8) Update Slurm cluster to use the custom image ID.

Verify the CycleCloud and Slurm

(1) Start cluster and verify scheduler status.


(2) SSH from CycleCloud server to scheduler and submit test job:

1
2
3
4
5
sudo ssh-keygen -R 10.0.4.4
sudo ssh -i /opt/cycle_server/.ssh/cyclecloud.pem cyclecloud@10.0.4.4

sinfo
sbatch -p hpc --wrap="hostname; sleep 300"

(3) Confirm compute nodes are ready.