# Slurm Monitoring with Prometheus and Grafana
Monitoring Slurm, a powerful workload manager for Linux clusters, is crucial for optimizing cluster performance and resource utilization. In this blog post, we’ll explore how to set up Slurm monitoring using Prometheus and Grafana, two popular open-source tools for monitoring and visualization.
# Preparing Slurm Exporter
To start monitoring Slurm with Prometheus, we first need to prepare the Slurm Exporter. Follow these steps:
1 | sudo yum install -y golang |
This installs Golang, clones the Slurm Exporter repository, builds the exporter, and sets it up as a systemd service.
# Preparing Node Exporter
Next, we prepare the Node Exporter for system-level metrics:
1 | wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz |
This downloads and sets up the Node Exporter binary and creates a systemd service for automatic startup.
# Building Docker Image
Building a Docker image for Prometheus involves creating a Dockerfile and a prometheus.yml file. Here’s a simple example Dockerfile:
1 | FROM prom/prometheus |
Build the Docker image using:
1 | docker build -t my-prometheus . |
# Pushing Image to ACR
To deploy the Docker images to Azure Container Registry (ACR), follow these steps:
1 | az acr login --name prometheus20231215 |
Login to ACR, tag the images, and push them to the registry.
# Configuring App Service for Container
Configure Azure App Service for Containers to run Prometheus and Grafana. This step involves setting up the necessary environment variables and ensuring proper connectivity.
# Exploring Prometheus and Grafana Web UI
By following these steps, you can establish a robust Slurm monitoring system using Prometheus and Grafana, empowering you to make informed decisions about cluster resource management.
- Add the Prometheus as data source
- Add the Slurm dashboard
After completing the setup, explore the Prometheus and Grafana Web UIs. Access metrics, create dashboards, and gain insights into Slurm cluster performance.
- Check the Prometheus Web UI
- Check the Grafana Web UI
# Reference
- Sean Smith - GPU Monitoring with Grafana
- Slurm dashboard