# Slurm Monitoring with Prometheus and Grafana

Monitoring Slurm, a powerful workload manager for Linux clusters, is crucial for optimizing cluster performance and resource utilization. In this blog post, we’ll explore how to set up Slurm monitoring using Prometheus and Grafana, two popular open-source tools for monitoring and visualization.

# Preparing Slurm Exporter

To start monitoring Slurm with Prometheus, we first need to prepare the Slurm Exporter. Follow these steps:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
sudo yum install -y golang 
git clone -b 0.20 https://github.com/vpenso/prometheus-slurm-exporter.git
cd prometheus-slurm-exporter
make && sudo cp bin/prometheus-slurm-exporter /usr/bin/
sudo su cat > /etc/systemd/system/prometheus-slurm-exporter.service << EOF
[Unit]
Description=Prometheus SLURM Exporter

[Service]
ExecStart=/usr/bin/prometheus-slurm-exporter
Restart=on-failure
RestartSec=15
Type=simple

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus-slurm-exporter

This installs Golang, clones the Slurm Exporter repository, builds the exporter, and sets it up as a systemd service.

# Preparing Node Exporter

Next, we prepare the Node Exporter for system-level metrics:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
sudo cp node_exporter /usr/bin/
sudo su cat > /etc/systemd/system/node-exporter.service << EOF
[Unit]
Description=Prometheus Node Exporter

[Service]
ExecStart=/usr/bin/node_exporter
Restart=on-failure
RestartSec=15
Type=simple

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node-exporter

This downloads and sets up the Node Exporter binary and creates a systemd service for automatic startup.

# Building Docker Image

Building a Docker image for Prometheus involves creating a Dockerfile and a prometheus.yml file. Here’s a simple example Dockerfile:

1
2
FROM prom/prometheus
ADD prometheus.yml /etc/prometheus/

Build the Docker image using:

1
docker build -t my-prometheus .

# Pushing Image to ACR

To deploy the Docker images to Azure Container Registry (ACR), follow these steps:

1
2
3
4
5
az acr login --name prometheus20231215 
docker tag my-prometheus:latest prometheus20231215.azurecr.io/my-prometheus:latest
docker push prometheus20231215.azurecr.io/my-prometheus:latest
docker tag grafana/grafana-oss:latest prometheus20231215.azurecr.io/grafana-oss:latest
docker push prometheus20231215.azurecr.io/grafana-oss:latest

Login to ACR, tag the images, and push them to the registry.

# Configuring App Service for Container

Configure Azure App Service for Containers to run Prometheus and Grafana. This step involves setting up the necessary environment variables and ensuring proper connectivity.

alt text

alt text

# Exploring Prometheus and Grafana Web UI

By following these steps, you can establish a robust Slurm monitoring system using Prometheus and Grafana, empowering you to make informed decisions about cluster resource management.

  1. Add the Prometheus as data source

alt text

  1. Add the Slurm dashboard

alt text

After completing the setup, explore the Prometheus and Grafana Web UIs. Access metrics, create dashboards, and gain insights into Slurm cluster performance.

  1. Check the Prometheus Web UI

alt text

  1. Check the Grafana Web UI

alt text


# Reference

  • Sean Smith - GPU Monitoring with Grafana
  • Slurm dashboard