SLURM Workload Manager

Slurm

Slurm (also referred as Slurm Workload Manager) is an open-source workload manager designed for Linux clusters of all sizes, used by many of the world’s supercomputers and computer clusters.

The cluster uses slurm as a batch system which provides job scheduler and resource manager within a single product. When users submit their jobs to the cluster, the need for a mechanism to distribute the jobs across available nodes is inevitable. This is the responsibility of the slurm as the job scheduler to make the decision based on policies, priorities and resource requirements. In order for the job scheduler to be able to decide properly and figure out the complete status of each node regarding the available resources at a given time, it needs to communicate with another entity- resource manager- which handle these low level tasks such as start, hold, cancel and monitoring jobs and OS. Slurm also handle the resource manager tasks in this implementation. So we used the possibility that a resource manager being integrated into Job Scheduler and being used as one entity as in Slurm.

In short I would say Slurm do following two key tasks:

a. It allocate resources (compute nodes) to users for a period of time and if needed ease the contention by queuing pending work. The access can be exclusive or non-exclusive. This means that several jobs can run at the same time if needed on same nodes, or Admin define the poliies in a way that in any time only one job have access to resources.

b. It can start, stop and specially monitor the jobs on a set of allocated nodes.

The best way of installing the slurm is to compile it from the source and install the binaries and libraries in a shared directory and share it through NFS to all nodes such as compute nodes and login node and Master node.

The above Figure shows the simple setup which is consist of slurmctld daemon which is installed in Master node and slurmd daemon which is installed in compute nodes which are running the jobs. It is important to consider the fact that if our cluster also has a login node, we don’t need to install any daemon on it and only need to share the /opt/software/slurm directory and only load the environment by Module to the shell. If there is a need for accounting, we can also install slurmdbd daemon with a proper database as I explain later.

Configuration

The main idea is to install the slurm in a directory that can be shared among all nodes (Master node – Login node and compute nodes). So I will not install slurm on each compute nodes or login node. The directory that I consider for this purpose is /opt/software/slurm.

Important: Before start with slurm installation and configuration, please make sure that time is completely sync between Master node and compute nodes. You can follow my thread for NTP configuration in Linux part.

a. Slurm installation without Accounting

The first step is installing munge for authentication.

Munge

“munge is an authentication service for creating and validating credentials. It is designed to be highly scalable for use in an HPC cluster environment. It allows a process to authenticate the UID and GID of another local or remote process within a group of hosts having common users and groups. These hosts form a security realm that is defined by a shared cryptographic key. Clients within this security realm can create and validate credentials without the use of root privileges, reserved ports, or platform-specific methods.” (taken from https://dun.github.io/munge/).

1. in Master node

[root@slurm-master ~]# yum install munge*

2. Compute nodes (or image)

[root@qingcl-master ~]# yum –installroot=/install/netboot/centos7.2/x86_64/compute/rootimg/ install munge.x86_64

important: The installation of munge will add automatically munge username to the system.

3. Key Creation for munge authentication

[root@slurm-master ~]# dd if=/dev/random bs=1 count=32 > /etc/munge/munge.key

Also change ownership and permission of the key

chown munge:munge /etc/munge/munge.key

chmod 400 /etc/munge/munge.key (means Read by owner)

and then copy also to image (compute nodes), make sure the permission and ownership are same (that’s why I used rsync)

[root@slurm-master]# rsync -av munge.key /install/netboot/centos7.2/x86_64/compute/rootimg/etc/munge/

[root@slurm-master ~]# ll /etc/munge/munge.key

-r——– 1 munge munge 10 Jul 18 16:50 /etc/munge/munge.key

We need to make sure that munge has same uid/gid in both /etc/passwd and /etc/group in all nodes (master node, compute nodes,..), otherwise we can simply modify it by an editor.

In our case which only have maste node and compute nodes:

master: munge:x:985:982:Runs Uid ‘N’ Gid Emporium:/var/run/munge:/sbin/nologin

image: munge:x:985:982:Runs Uid ‘N’ Gid Emporium:/var/run/munge:/sbin/nologin

Slurm

first need to download the latest stable version of slurm from website (http://www.schedmd.com/). We download slurm-16.05.2 version and move it to Master node.

Then we untar it and do following: (I specify the directory that I want slurm being installed since this directory is mounted in all nodes)

./configure –enable-debug –prefix=/opt/software/slurm/16.05.2
make
make install

You can see the complete options by ./configure –help if you need to define more options.

Then we have to create 2 systemd files, one for controller in Master node and one for compute nodes. By installation of slurm, we get needed files (slurmctld.service and slurmd.service) which in our case has been stored in following directory and only needs to be copied into /etc/systemd/system of Master node and compute nodes.

/install/software/src/slurm/slurm-16.05.2/etc/

and copy the slurmctld.service to our Master server in following directory of systemd

[root@qingcl-master etc]# cp /install/software/src/slurm/slurm-16.05.2/etc/slurmctld.service /etc/systemd/system/slurmctld.service

and we do the same for compute nodes (image) by comping slurmd.service into /etc/systemd/system directory.

Important: Maybe another professional way is to copy the above files in /usr/lib/systemd/system/ of Master node and compute nodes. Afterwards by enabling the service (systemctl enable slurmctld/slurmd) the soft link is created automatically in /etc/systemd/… directory.

[root@slurm-master etc]# cat /etc/systemd/system/slurmctld.service

[Unit]
Description=Slurm controller daemon
After=network.target
ConditionPathExists=/opt/software/slurm/16.05.2/etc/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStart=/opt/software/slurm/16.05.2/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmctld.pid
[Install]
WantedBy=multi-user.target

As can be seen in the above file, we have to correct the directory referring to the slurm.conf. Also the PIDFile is very important and needs to be the same directory in slurm.conf file as in systemd files.

As we know some systemd services have configuration file which can be used for several purposes such as troubleshooting and debugging. Here we created following configuration file:

[root@slurm-master etc]# cat /etc/sysconfig/slurmctld

SLURMCTLD_OPTIONS=” -v -L /var/log/slurmctld.log -f /opt/software/slurm/16.05.2/etc/slurm.conf”

So by having EnvironmentFile in systemd files, when we start these service and faced with problems, we can read the files in /etc/sysconfig directory.

However you have to wait until we define the slurm.conf file.

[root@slurm-master ~]# /opt/software/slurm/16.05.2/sbin/slurmctld -L /var/log/slurmctld.log -f /opt/software/slurm/16.05.2/etc/slurm.conf

any problem we will notice here and we can solve. Then we can restart the service with systemctl restart slurmctld.service

Now in the image (compute nodes)

same as above, first we create a configuration file for slurmd

[root@node-01 /]# cat /etc/sysconfig/slurmd

SLURMD_OPTIONS=”-f /opt/software/slurm/16.05.2/etc/slurm.conf”

[root@node-01 ~]# cat /etc/systemd/system/slurmd.service

[Unit]
Description=Slurm node daemon
After=network.target
ConditionPathExists=/opt/software/slurm/16.05.2/etc/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/opt/software/slurm/16.05.2/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target

Important: Always after running above commands for troubleshooting, we need to kill the corresponding services in order to be able to run it again for troubleshooting (ps aux | grep -i slurm*) otherwise it is very probable that we get these kind of error messages:

[2016-12-04T12:31:20.148] error: Error binding slurm stream socket: Address already in use
[2016-12-04T12:31:20.148] error: Unable to bind listen port (*:6818): Address already in use

slurm.conf file

I would say this is one of the most important file in slurm that needs special care. There are several slurm.conf generator tools and website that can ease this process. One of them that you can use is following link: https://slurm.schedmd.com/configurator.easy.html

However I create one simple and efficient one here that you can simply copy and paste and only modify some important parts.

[root@slurm-master ~]# cat /opt/software/slurm/16.05.2/etc/slurm.conf

ClusterName=hrouhani
ControlMachine=slurm-master
ControlAddr=192.168.1.21
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
StateSaveLocation=/tmp
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
# JOB PRIORITY
# LOGGING AND ACCOUNTING
SlurmctldDebug=3
SlurmdDebug=3
# COMPUTE NODES
NodeName=node-[01-04] RealMemory=64314 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
partitionName=debug Nodes=node-[01-04] Default=YES MaxTime=INFINITE State=UP

Impostant notes:

a. Make sure that time is completely sync between compute nodes and Master node. I had a problem that timezone was different due to the fact that /usr/share/zoneinfo was excluded in the osimage definition of xcat.

b. The slurm.conf is very sensitive. The resource definition need to be complety correct. To check the correct source definition in node:

/opt/software/…./slurmd -C

[root@node-01 ~]# /opt/software/slurm/sbin/slurmd -C
NodeName=node-11 CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=257723 TmpDisk=128861

and based on above result, I configured my slurm.conf like following

# COMPUTE NODES
NodeName=node-[01-36] CPUs=56 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=node-[01-36] Default=YES MaxTime=INFINITE State=UP

As can be seen, the hyperthreading is on and we have all together 56 cores in 2 sockets system.

c. In my configuration the slurmUser is slurm, you can also change it to root. If the slurm user has not been created with slurm installation in Master node, please manually create it (adduser slurm) and copy slurm line from /etc/passwd and /etc/group and copy it to Compute nodes image in order all have same UID/GID.

d. The parts which is bold in slurm.conf file needs to be modified based on your environment.

Runing the services

a. The first step is to run the munge service in both master node and all compute nodes. Make sure the time is completely sync, otherwise it will fail. Please read the thread I wrote regarding NTP if you have problems.

b. We start the munge service in master node.

c. Then we can start slurmctld in master node. If you faced with problems, please simply read the log files.

d. We star the munge service in compute nodes.

e. Then at the last step, we start the slurmd in all compute nodes.

f. At the last step, you can use some slurm commands to see the situation. sinfo is the main one which i propose to use at master node and monitor the status of all nodes.

root@slurm-master ~ # sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug up 13-21:20:0 8 idle node[01-08]

The state is in idle mode since there is no job is running on the nodes. If the state is down, then something is wrong and you need to check what is the sitution.

If you run a job on the some nodes, then the state of node will change to “alloc” which basically means the node(s) is running the job(s).

I have seen that sometimes the nodes are in down state for some reason, and we simply use following command to put back the node in idle state.

[root@slurm-master ~]# scontrol update NodeName=node-01 State=RESUME

Important: You might be confused why you don’t have slurm commands like sinfo, scrontrol or others. The reason is that slurm is not in the PATH of your environment. I will use module files to put all necessary slurm directories in my PATH beforehand. Please have a look Module files thread for more information.

**********************************************************************************************************************************************************************

a. Slurm installation with Accounting

Munge step is exactly the same as above, so I will not go through it here.

Slurm can be configured to collect accounting information for every job and job step executed. Accounting records can be written to a simple text file or a database. Information is available about both currently executing jobs and jobs which have already terminated.

The storing of accounting data into text files can be done easily by specifying the textfile as accounting storage type. However this file can easily grow and getting too large and being hard to handle. Therefore, the easiest and recommended way is to using a proper database for storing the accounting information. Mysql is the only fully supported option for this purpose.

There are 2 ways of using database for storing the accounting information which are:

Storing the data directly into a database
Storing the data indirectly into a database

Storing the data directly into a database provides some complexity since it needs the availability of the Username/Password for both:

Slurm control daemon (slurmctld)
User commands which need to access the data (sacct, sreport, and sacctmgr)

Which has a security concept problem since then all users have an access to these information. Therefor , the recommended way is using second option which is storing the data indirectly -through an intermediate daemon- into a database which then as a result provide better security and performance. SlurmDBD (Slurm Database Daemon) provides such services. SlurmDBD is written in C, multi-threaded, secure and fast.

selection_693

In this case, the main slurm daemon (slurmctld) is working closely with slurmdbd and as a result it needs to be up before slurmctld is coming up.

Before installing slurm (building) into our system we need to provide some prerequisite. The first point is to decide the database type which we want to use for storing the data. The preferred database are MySQL or MariaDB. Both of these databases are quite same, therefore I use MariaDB.

[root@slurm-master# yum install mariadb mariadb-devel mariadb-server

so we should have:

[root@slurm-master ~]# yum list installed | grep -i mariadb

mariadb.x86_64 1:5.5.50-1.el7_2 @updates

mariadb-devel.x86_64 1:5.5.50-1.el7_2 @updates

mariadb-libs.x86_64 1:5.5.50-1.el7_2 @updates

mariadb-server.x86_64 1:5.5.50-1.el7_2 @updates

This will install a database which the version 5.5.50 is quite similar to mysql and use same commands. As a result of above installation, a file called mysql_config file will come which has all of the information about the installed database. It provides us with useful information about connecting the slurm to Mariadb. In our case it has located at /usr/bin/mysql_config and I use this directory during building slurm configuration file. So now I build the slurm configuration file and then install it.

root@slurm-master]# ./configure –prefix=/opt/software/slurm/16.05.4 –with-munge –with-mysql_config=/usr/bin/mysql_config
make
make install

The first step that needs to be done is to configure the database. For this purpose, first we need to start the mariadb database.

[root@slurm-maste~]# systemctl start mariadb.service

And then we need to configure our mariadb database. We need to create the database ourself, but slurm will create the appropriate tables automatically and we don’t need to do so. Therefor we need to give the slurm user proper permissions that it can create the tables.

[root@slurm-maste ~]# mysql

Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 5.5.50-MariaDB MariaDB Server

If we want to enable the slurm to create the database itself for any future needs, we can grant all on *.* instead of slurm_acct_db.*. In order to be sure that all works properly, I grant the permission by using both localhost and localhost system’s name.

MariaDB [(none)]> grant all on slurm_acct_db.* to ‘slurm’@’localhost’;

Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> grant all on slurm_acct_db.* to ‘slurm’@’jameson’;

Query OK, 0 rows affected (0.00 sec)

I also give the password “hossein!” to access the database and as we see later, I specify the same password at slurmdbd.conf file.

MariaDB [(none)]> grant all on slurm_acct_db.* to ‘slurm’@’localhost’ identified by ‘hossein!’ with grant option;

Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> grant all on slurm_acct_db.* to ‘slurm’@’jameson’ identified by ‘transtec!’ with grant option;

Query OK, 0 rows affected (0.01 sec)

Slurm uses InnoDB storage engine to make rollback possible and this must be available in the MariaDB installation, otherwise the rollback simply is not going to work. We confirm it with following command:

MariaDB [(none)]> SHOW VARIABLES LIKE ‘have_innodb’;

+—————+——-+
| Variable_name | Value |
+—————+——-+
| have_innodb | YES |
+—————+——-+
1 row in set (0.00 sec)

MariaDB [(none)]> create database slurm_acct_db;

Query OK, 1 row affected (0.00 sec)

The next step is to create a proper systemd files. We can easily copy the files that come with slrum packages same as we did for slurm installation without accounting part.

cp /opt/software/src/slurm/slurm-16.05.4/etc/slurmctld.service /etc/systemd/system
cp /opt/software/src/slurm/slurm-16.05.4/etc/slurmdbd.service /etc/systemd/system
cp /opt/software/src/slurm/slurm-16.05.4/etc/slurmd.service /computeNodes/etc/systemd/system

Note: The plan is that Master node is hosting both slurmctld and slurmdbd daemons and compute nodes are having slurmd.

Then we need to modify these systemd files that refer to right path of slurm.conf and slurmdbd.conf files. First we create a path that we will save our slurm.conf and slurmdbd.conf files there.

[root@slurm-master]# mkdir -p /opt/software/slurm/16.05.4/etc

and need to make sure that the ConditionPathExists directory is same as our slurm.conf and slurmdbd.conf directories for slurmctld.service and slurmdbd.service. Example of my case:

[root@slurm-master~]# cat /etc/systemd/system/slurmctld.service

[Unit]
Description=Slurm controller daemon
After=network.target
ConditionPathExists=/opt/software/slurm/16.05.4/etc/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStart=/opt/software/slurm/16.05.4/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmctld.pid
[Install]
WantedBy=multi-user.target

[root@slurm-master ~]# cat /etc/systemd/system/slurmdbd.service

[Unit]
Description=Slurm DBD accounting daemon
After=network.target
ConditionPathExists=/opt/software/slurm/16.05.4/etc/slurmdbd.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmdbd
ExecStart=/opt/software/slurm/16.05.4/sbin/slurmdbd $SLURMDBD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmdbd.pid
[Install]
WantedBy=multi-user.target

The EnvironmentFile is refering to the file that can be used for creating a log file when running the above services which basically helps the troubleshooting.

[root@slurm-master ]# cat /etc/sysconfig/slurmctld

SLURMCTLD_OPTIONS=” -v -L /var/log/slurmctld.log -f /opt/software/slurm/16.05.4/etc/slurm.conf”

so since the above directory (/etc/sysconfig/slurmctld) is inside the systemd file of (slurmctld.service), therefore when we restart the service (systemctl restart slurmctld), it automatically write the logs inside the /var/log/slurmctld directory and we can read it. We also can do it directory without restarting the service like following:

[root@slurm-master ~]# /opt/software/slurm/16.05.4/sbin/slurmctld -L /var/log/slurmctld.log -f /opt/software/slurm/16.05.4/etc/slurm.conf

[root@slurm-master ~]# cat /etc/sysconfig/slurmdbd

SLURMCTLD_OPTIONS=” -L /var/log/slurmdbd.log -f /opt/software/slurm/16.05.4/etc/slurmdbd.conf”

we can do the same for compute nodes (in our case image). So first copy the slurmd.service (systemd) file to image.

[root@slurm-master ~]# cp /etc/systemd/system/slurmd.service /install/netboot/centos7.2/x86_64/compute/rootimg/etc/systemd/system/

and then also create following directory inside the image (mkdir /etc/sysconfig/slurmd) and inside is same as before following line:

[root@slurm-master /]# cat /etc/sysconfig/slurmd

SLURMCTLD_OPTIONS=” -v -L /var/log/slurmd.log -f /opt/software/slurm/16.05.4/etc/slurm.conf”

We should not forget that /opt/software directory being mounted in all compute nodes. At the next level we make sure that slurm username has same uid/gid in both Master node and compute image (to be same in all compute nodes).

[root@slurm-master ~]# cat /etc/passwd | grep -i slurm

slurm:x:982:982::/var/lib/slurm:/bin/bash

[root@slurm-master ~]# cat /etc/group | grep -i slurm

slurm:x:982:

However it is also possible to do it with root username, but the prefer username is slurm that being separated from root.

Now lets focus on two configuration files that we need for our case. I simply copy a simple example of these files here that works perfectly. However you add or modify it based on your needs, optionally for more advanced options.

[root@slurm-master ~]# cat /opt/software/slurm/16.05.4/etc/slurm.conf

ControlAddr=10.0.0.254
ClusterName=hrouhani
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/tmp
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
MaxArraySize=100000
# ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurm-master
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
CpuFreqGovernors=OnDemand,Performance,UserSpace
CpuFreqDef=Performance
# COMPUTE NODES
NodeName=node-[01-08] Sockets=2 CoresPerSocket=12 CPUs=48 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=node-[01-08] Default=YES MaxTime=24:00:00 State=UP

[root@slurm-master ~]# cat /opt/software/slurm/16.05.4/etc/slurmdbd.conf

# Authentication info
AuthType=auth/munge
# slurmDBD info
#DbdAddr=localhost
DbdHost=slurm-master
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=4
#DefaultQOS=normal,standby
LogFile=/var/log/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
# Database info
StorageType=accounting_storage/mysql
StorageHost=slurm-master
#StoragePort=3306
StoragePass=hrz!
StorageUser=slurm
StorageLoc=slurm_acct_db

Before doing anything we need to create directory:

mkdir /var/spool/slurm
chown -R slurm:slurm /var/spool/slurm
mkdir /var/log/slurm
chown -R slurm:slurm /var/log/slurm

How to start the services:

a. First need to check if Munge service is up in all nodes
b. Check if Mariadb is up in Master node
c. then slurmctld need to come up
d. then slurmdbd need to come up in master node
d. And at the end, slurmd in all nodes

At the end if you want to know how it was configured during the installation, we can easily check with

[root@slurm-master]# less config.log

[root@slurm-master]# pwd

/opt/software/src/slurm/slurm-16.05.4

[root@slurm-master]# less config.log

***************************************************************************

Testing Phase

Test 1: For slurm we need to change to User since it does not work with root. So first on Master node:

su – hrz

and then I create a directory called test in home directory. Consider the fact that users home directory is mounted in all compute nodes, as I will explain it in separate thread.

[hrz@slurm-master]$ cat submit.sh

#!/bin/bash
#
#SBATCH –job-name=test
#SBATCH –output=res.txt
#
#SBATCH –ntasks=1
#SBATCH –time=10:00
#SBATCH –mem-per-cpu=100
srun hostname
srun sleep 60

And then simply run it with following command:

sbatch submit.sh

and then we can monitor the job with ‘squeue’ command.

[hrz@slurm-master]$ squeue

This is only running on one of the compute nodes.

a. Slurm installation without Accounting

Slurm

a. Slurm installation with Accounting

Testing Phase

Share this: