Setting Up Infiniband in Centos 6.7

Installing Infiniband Drivers

In Centos/RHEL, software support for Mellanox infiniband hardware is found in the package group “Infiniband Support”, which can be installed with yum:

$yum -y groupinstall "Infiniband Support"

This will install the required kernel modules, and the infiniband subnet manager opensm.

Several optional packages are also available that make configuring and trouble shooting the network easier:

$yum -y install infiniband-diags perftest gperf

inifiniband-diags is a network diagnostic package containing useful analysis tools such as ibping and ibstat.

perftest and gperf are performance testing packages containing benchmarking tools.

Infinband in Centos uses RDMA (Remote Direct Memory Access, the ability to access the memory of a host without disturbing it’s CPU). Set the rdma and opensm services to start at boot time:

$chkconfig rdma on
$chkconfig opensm on

Restart the system. This will load the required kernel modules and start rdma and opensm.

$reboot

Checking Network Connectivity

After the system reboots, check the status of rdma and opensm (status should be on):

$service rdma status
$service opensm status

Check that any infiniband interfaces are now recognized. They should appear as ib0, ib1, etc. with 20 byte hardware addresses:

$ifconfig

Test status of local infiniband link with ibstat:

$ibstat

Connected ports are in the "LinkUp" state. Each interface will have a Base LID (link id) field by which it will be recognized by other hosts.

To display ib hosts and ib switches visible on the network run the ibhosts and ibswitches commands:

$ibhosts
$ibswitches

To test network connectivity with the ibping command (equivalent to icmp ping command), navigate to the host you wish to ping and start its ibping server:

$ibping -S &

Run ibstat to find the base link id of the connected port. Then navigate the host you wish to ping from and run the command:

$ibping <lid_dest>

You should see a stream of packets traveling to the destination host with transfer times and sequence numbers.

Configuring the Network

Before making any changes to network scripts stop the network service:

$service network stop

Centos 6.7 supports “infiniband over ip” meaning that the infiniband interfaces can be configured with ip addresses just like Ethernet interfaces. Configuration files can be found in the /etc/sysconfig/network-scripts folder, and have ib0, ib1, etc in the file name.

By default, the script should have values similar to:

TYPE=infiniband
BOOTPROTO=dhcp
NAME=ib0
UUID=
NM_CONTROLLED=yes
DEVICE=ib0
ONBOOT=no
To configure a static ip address, change the ONBOOT value to “yes”, the bootproto value to “static”, the NM_CONTROLLED value to “no.” Add the ip address, netmask, and router address. The modified script should look like:
TYPE=infiniband
BOOTPROTO=static
NAME=ib0
UUID=
NM_CONTROLLED=no
DEVICE=ib0
ONBOOT=yes
IPADDR=<address>
GATEWAY=<router address, usually first address of subnet>
NETMASK=<netmask>
 Save changes to the script and start the network service.
$service network start
Run ifconfig again to make sure the infiniband interfaces have static ip addresses.

Setup Hostname without DNS Server on CentOS 7

Let’s say that you have a cluster of 6 nodes and you do not have a DNS server.

How do you map the IPs to their perspective hostnames properly so that every node in the cluster can detect the other node’s hostname? We need to bypass the automatic lookup.

I have 6 nodes with the following IPs and hostnames.

128.197.115.158 buhpc1
128.197.115.7 buhpc2
128.197.115.176 buhpc3
128.197.115.17 buhpc4
128.197.115.9 buhpc5
128.197.115.15 buhpc6

Now, we edit every node’s /etc/hosts file.

For all nodes

vi /etc/hosts

The original files look like this:

We change the files to look like this:

Restart the network.
systemctl restart network

Make sure that the /etc/hosts file is exactly the same on all 6 nodes! Then, after restarting all the networks with the above command, the /etc/hosts file with bypass the automatic DNS lookup and relate those hostnames to those IP addresses!

Testing

If I am on buhpc1, and I do:

ping buhpc2

I will be pinging 128.197.115.7. In the /etc/hosts file, we have defined 128.197.115.7 as buhpc2.

How to Install Slurm on CentOS 7 Cluster

Slurm is an open-source workload manager designed for Linux clusters of all sizes. It’s a great system for queuing jobs for your HPC applications. I’m going to show you how to install Slurm on a CentOS 7 cluster.

  1. Delete failed installation of Slurm
  2. Install MariaDB
  3. Create the global users
  4. Install  Munge
  5. Install Slurm
  6. Use Slurm

Cluster Server and Compute Nodes

I configured our nodes with the following hostnames using these steps. Our server is:

buhpc3

The clients are:

buhpc1
buhpc2
buhpc3
buhpc4
buhpc5
buhpc6

 

Delete failed installation of Slurm

I leave this optional step in case you tried to install Slurm, and it didn’t work. We want to uninstall the parts related to Slurm unless you’re using the dependencies for something else.

First, I remove the database where I kept Slurm’s accounting.

yum remove mariadb-server mariadb-devel -y

Next, I remove Slurm and Munge. Munge is an authentication tool used to identify messaging from the Slurm machines.

yum remove slurm munge munge-libs munge-devel -y

I check if the slurm and munge users exist.

cat /etc/passwd | grep slurm

Then, I delete the users and corresponding folders.

userdel - r slurm
userdel -r munge
userdel: user munge is currently used by process 26278
kill 26278
userdel -r munge

Slurm, Munge, and Mariadb should be adequately wiped. Now, we can start a fresh installation that actually works.

 

Install MariaDB

You can install MariaDB to store the accounting that Slurm provides. If you want to store accounting, here’s the time to do so. I only install this on the server node, buhpc3. I use the server node as our SlurmDB node.

yum install mariadb-server mariadb-devel -y

We’ll setup MariaDB later. We just need to install it before building the Slurm RPMs.

 

Create the global users

Slurm and Munge require consistent UID and GID across every node in the cluster.

For all the nodes, before you install Slurm or Munge:

export MUNGEUSER=991
groupadd -g $MUNGEUSER munge
useradd  -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge  -s /sbin/nologin munge
export SLURMUSER=992
groupadd -g $SLURMUSER slurm
useradd  -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm  -s /bin/bash slurm

 

Install Munge

Since I’m using CentOS 7, I need to get the latest EPEL repository.

yum install epel-release

Now, I can install Munge.

yum install munge munge-libs munge-devel -y

After installing Munge, I need to create a secret key on the Server. My server is on the node with hostname, buhpc3. Choose one of your nodes to be the server node.

First, we install rng-tools to properly create the key.

yum install rng-tools -y
rngd -r /dev/urandom

Now, we create the secret key. You only have to do the creation of the secret key on the server.

/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

After the secret key is created, you will need to send this key to all of the compute nodes.

scp /etc/munge/munge.key [email protected]:/etc/munge
scp /etc/munge/munge.key [email protected]:/etc/munge
scp /etc/munge/munge.key [email protected]:/etc/munge
scp /etc/munge/munge.key [email protected]:/etc/munge
scp /etc/munge/munge.key [email protected]:/etc/munge

Now, we SSH into every node and correct the permissions as well as start the Munge service.

chown -R munge: /etc/munge/ /var/log/munge/
chmod 0700 /etc/munge/ /var/log/munge/
systemctl enable munge
systemctl start munge

To test Munge, we can try to access another node with Munge from our server node, buhpc3.

munge -n
munge -n | unmunge
munge -n | ssh 3.buhpc.com unmunge
remunge

If you encounter no errors, then Munge is working as expected.

 

Install Slurm

Slurm has a few dependencies that we need to install before proceeding.

yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y

Now, we download the latest version of Slurm preferably in our shared folder. The latest version of Slurm may be different from our version.

cd /nfs
wget http://www.schedmd.com/download/latest/slurm-15.08.9.tar.bz2

If you don’t have rpmbuild yet:

yum install rpm-build
rpmbuild -ta slurm-15.08.9.tar.bz2

We will check the rpms created by rpmbuild.

cd /root/rpmbuild/RPMS/x86_64

Now, we will move the Slurm rpms for installation for the server and computer nodes.

mkdir /nfs/slurm-rpms
cp slurm-15.08.9-1.el7.centos.x86_64.rpm slurm-devel-15.08.9-1.el7.centos.x86_64.rpm slurm-munge-15.08.9-1.el7.centos.x86_64.rpm slurm-perlapi-15.08.9-1.el7.centos.x86_64.rpm slurm-plugins-15.08.9-1.el7.centos.x86_64.rpm slurm-sjobexit-15.08.9-1.el7.centos.x86_64.rpm slurm-sjstat-15.08.9-1.el7.centos.x86_64.rpm slurm-torque-15.08.9-1.el7.centos.x86_64.rpm /nfs/slurm-rpms

On every node that you want to be a server and compute node, we install those rpms. In our case, I want every node to be a compute node.

yum --nogpgcheck localinstall slurm-15.08.9-1.el7.centos.x86_64.rpm slurm-devel-15.08.9-1.el7.centos.x86_64.rpm slurm-munge-15.08.9-1.el7.centos.x86_64.rpm slurm-perlapi-15.08.9-1.el7.centos.x86_64.rpm slurm-plugins-15.08.9-1.el7.centos.x86_64.rpm slurm-sjobexit-15.08.9-1.el7.centos.x86_64.rpm slurm-sjstat-15.08.9-1.el7.centos.x86_64.rpm slurm-torque-15.08.9-1.el7.centos.x86_64.rpm

After we have installed Slurm on every machine, we will configure Slurm properly.

Visit http://slurm.schedmd.com/configurator.easy.html to make a configuration file for Slurm.

I leave everything default except:

ControlMachine: buhpc3
ControlAddr: 128.197.116.18
NodeName: buhpc[1-6]
CPUs: 4
StateSaveLocation: /var/spool/slurmctld
SlurmctldLogFile: /var/log/slurmctld.log
SlurmdLogFile: /var/log/slurmd.log
ClusterName: buhpc

After you hit Submit on the form, you will be given the full Slurm configuration file to copy.

On the server node, which is buhpc3:

cd /etc/slurm
vim slurm.conf

Copy the form’s Slurm configuration file that was created from the website and paste it into slurm.conf. We still need to change something in that file.

Underneathe slurm.conf “# COMPUTE NODES,” we see that Slurm tries to determine the IP addresses automatically with the one line.

NodeName=buhpc[1-6] CPUs = 4 State = UNKOWN

I don’t use IP addresses in order, so I manually delete this one line and change it to:

After you explicitly put in the NodeAddr IP Addresses, you can save and quit. Here is my full slurm.conf and what it looks like:

Now that the server node has the slurm.conf correctly, we need to send this file to the other compute nodes.

scp slurm.conf [email protected]/etc/slurm/slurm.conf
scp slurm.conf [email protected]/etc/slurm/slurm.conf
scp slurm.conf [email protected]/etc/slurm/slurm.conf
scp slurm.conf [email protected]/etc/slurm/slurm.conf
scp slurm.conf [email protected]/etc/slurm/slurm.conf

Now, we will configure the server node, buhpc3. We need to make sure that the server has all the right configurations and files.

mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chmod 755 /var/spool/slurmctld
touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log
touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log

Now, we will configure all the compute nodes, buhpc[1-6]. We need to make sure that all the compute nodes have the right configurations and files.

mkdir /var/spool/slurmd
chown slurm: /var/spool/slurmd
chmod 755 /var/spool/slurmd
touch /var/log/slurmd.log
chown slurm: /var/log/slurmd.log

Use the following command to make sure that slurmd is configured properly.

slurmd -C

You should get something like this:

ClusterName=(null) NodeName=buhpc3 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=7822 TmpDisk=45753
UpTime=13-14:27:52

The firewall will block connections between nodes, so I normally disable the firewall on the compute nodes except for buhpc3.

systemctl stop firewalld
systemctl disable firewalld

On the server node, buhpc3, I usually open the default ports that Slurm uses:

firewall-cmd --permanent --zone=public --add-port=6817/udp
firewall-cmd --permanent --zone=public --add-port=6817/tcp
firewall-cmd --permanent --zone=public --add-port=6818/tcp
firewall-cmd --permanent --zone=public --add-port=6818/tcp
firewall-cmd --permanent --zone=public --add-port=7321/tcp
firewall-cmd --permanent --zone=public --add-port=7321/tcp
firewall-cmd --reload

If the port freeing does not work, stop the firewalld for testing. Next, we need to check for out of sync clocks on the cluster. On every node:

yum install ntp -y
chkconfig ntpd on
ntpdate pool.ntp.org
systemctl start ntpd

The clocks should be synced, so we can try starting Slurm! On all the compute nodes, buhpc[1-6]:

systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service

Now, on the server node, buhpc3:

systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service

When you check the status of slurmd and slurmctld, we should see if they successfully completed or not. If problems happen, check the logs!

Compute node bugs: tail /var/log/slurmd.log
Server node bugs: tail /var/log/slurmctld.log

Use Slurm

To display the compute nodes:

scontrol show nodes

-N allows you to choose how many compute nodes that you want to use. To run jobs on the server node, buhpc3:

srun -N5 /bin/hostname
buhpc3
buhpc2
buhpc4
buhpc5
buhpc1

To display the job queue:

scontrol show jobs
JobId=16 JobName=hostname
UserId=root(0) GroupId=root(0)
Priority=4294901746 Nice=0 Account=(null) QOS=(null)
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2016-04-10T16:26:04 EligibleTime=2016-04-10T16:26:04
StartTime=2016-04-10T16:26:04 EndTime=2016-04-10T16:26:04
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=debug AllocNode:Sid=buhpc3:1834
ReqNodeList=(null) ExcNodeList=(null)
NodeList=buhpc[1-5]
BatchHost=buhpc1
NumNodes=5 NumCPUs=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,node=5
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/bin/hostname
WorkDir=/root
Power= SICP=0

To submit script jobs, create a script file that contains the commands that you want to run. Then:

sbatch -N2 script-file

Slurm has a lot of useful commands. You may have heard of other queuing tools like torque. Here’s a useful link for the command differences: http://www.sdsc.edu/~hocks/FG/PBS.slurm.html

 

Accounting in Slurm

We’ll worry about accounting in Slurm with MariaDB for next time. Let me know if you encounter any problems with the above steps!

 

How to Bypass Intel PXE Boot

You might encounter a PXE Boot loading screen on your cluster, but you don’t have PXE Boot. And your network isn’t even configured yet! Many machines built to be part of clusters will pop up to this PXE screen and will loop forever! What do you do?

IMG_20160318_211013_new

 

Change the Boot Order

Let’s say that all I want to do is boot the machine and install a new operating system. I have my trusty CentOS 7 Bootable USB installer drive with me, and I plug it into the computer. First, we want to restart our machine, and press the run setup button on the boot screen, which is normally F2, but on this machine, it is DEL.

IMG_20160318_210814

We will maneuver tabs until we get to the Boot Order tab.

IMG_20160318_210910

As you look at all the devices, you’ll see a couple of things in your boot order called IBA GE Slot. These PCI slots trigger the PXE boot. We want to make sure that these two slots are at the bottom of the boot order. <+> moves devices up and <-> moves devices down.

IMG_20160318_210832

Move USB HDD to the top and both PCI BEV: IBA GE Slots to the bottom. Make sure that USB HDD is associated with a number like 1: USB HDD. If it does not have a number, your hard drive is part of the excluded list and you need to find its name and press <x> to include it into the boot list. Then, move USB HDD all the way up with <+>.

IMG_20160318_210921

ESC. Hit save configurations, and your computer restarts. Wait while your computer boots into your bootable USB flash drive installer as expected. PXE boot will not boot before everything else thankfully enough!

yum [Errno 14] HTTP Error 404 – Not Found

When you run a yum command like:

yum install vim

You may get the following error:

Loaded plugins: fastestmirror
http://repo.dimenoc.com/pub/centos/7.1.1503/os/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found
Trying other mirror.
Is this ok [y/d/N]: y
Downloading packages:
Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
vim-enhanced-7.4.160-1.el7.x86 FAILED
http://ftp.linux.ncsu.edu/pub/CentOS/7.1.1503/os/x86_64/Packages/vim-enhanced-7.4.160-1.el7.x86_64.rpm: [Errno 14] HTTP Error 404 - Not Found

To fix the error:

yum clean all
yum update

Now try again:

yum install vim

The command should continue normally.