# Slurm Monitoring with Prometheus and Grafana

Monitoring Slurm, a powerful workload manager for Linux clusters, is crucial for optimizing cluster performance and resource utilization. In this blog post, we’ll explore how to set up Slurm monitoring using Prometheus and Grafana, two popular open-source tools for monitoring and visualization.

# Preparing Slurm Exporter

To start monitoring Slurm with Prometheus, we first need to prepare the Slurm Exporter. Follow these steps:

sudo yum install -y golang 
git clone -b 0.20 https://github.com/vpenso/prometheus-slurm-exporter.git 
cd prometheus-slurm-exporter 
make && sudo cp bin/prometheus-slurm-exporter /usr/bin/
sudo su cat > /etc/systemd/system/prometheus-slurm-exporter.service << EOF 
[Unit] 
Description=Prometheus SLURM Exporter 

[Service] 
ExecStart=/usr/bin/prometheus-slurm-exporter 
Restart=on-failure 
RestartSec=15 
Type=simple 

[Install] 
WantedBy=multi-user.target 
EOF 
sudo systemctl daemon-reload 
sudo systemctl enable --now prometheus-slurm-exporter

This installs Golang, clones the Slurm Exporter repository, builds the exporter, and sets it up as a systemd service.

# Preparing Node Exporter

Next, we prepare the Node Exporter for system-level metrics:

wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
sudo cp node_exporter /usr/bin/
sudo su cat > /etc/systemd/system/node-exporter.service << EOF 
[Unit] 
Description=Prometheus Node Exporter
 
[Service] 
ExecStart=/usr/bin/node_exporter 
Restart=on-failure 
RestartSec=15 
Type=simple 

[Install] 
WantedBy=multi-user.target 
EOF 
sudo systemctl daemon-reload 
sudo systemctl enable --now node-exporter

This downloads and sets up the Node Exporter binary and creates a systemd service for automatic startup.

# Building Docker Image

Building a Docker image for Prometheus involves creating a Dockerfile and a prometheus.yml file. Here’s a simple example Dockerfile:

1 2	FROM prom/prometheus ADD prometheus.yml /etc/prometheus/

Build the Docker image using:

1	docker build -t my-prometheus .

# Pushing Image to ACR

To deploy the Docker images to Azure Container Registry (ACR), follow these steps:

az acr login --name prometheus20231215 
docker tag my-prometheus:latest prometheus20231215.azurecr.io/my-prometheus:latest 
docker push prometheus20231215.azurecr.io/my-prometheus:latest 
docker tag grafana/grafana-oss:latest prometheus20231215.azurecr.io/grafana-oss:latest 
docker push prometheus20231215.azurecr.io/grafana-oss:latest

# Configuring App Service for Container

Configure Azure App Service for Containers to run Prometheus and Grafana. This step involves setting up the necessary environment variables and ensuring proper connectivity.

alt text

# Exploring Prometheus and Grafana Web UI

By following these steps, you can establish a robust Slurm monitoring system using Prometheus and Grafana, empowering you to make informed decisions about cluster resource management.

Add the Prometheus as data source

alt text

Add the Slurm dashboard

alt text

After completing the setup, explore the Prometheus and Grafana Web UIs. Access metrics, create dashboards, and gain insights into Slurm cluster performance.

Check the Prometheus Web UI

alt text

Check the Grafana Web UI

alt text

# Reference

Sean Smith - GPU Monitoring with Grafana
Slurm dashboard

# Slurm Monitoring with Prometheus and Grafana

# Preparing Slurm Exporter

# Preparing Node Exporter

# Building Docker Image

# Pushing Image to ACR

# Configuring App Service for Container

# Exploring Prometheus and Grafana Web UI

# Reference

Learning APIM Series (4) - Enabling API activity diagnostics logs with APIM

Integrating CycleCloud Server Logs with Azure Log Analytics for Enhanced Monitoring on Azure Dashboard