In this post, I’ll list some common troubleshooting problems that I have experienced with MPI libraries after I compiled MPICH on my cluster for the first time. The following assumes that:

  1. You have at least 2 nodes as part of your cluster.
  2. You have MPI compiled inside a NFS (Network File System), a shared folder.

I will divide the common problems into separate sections. For my first installation of an MPI library, I was using http://www.mpich.org/downloads/

 

MPI Paths on each Node

I placed my MPICH library and binaries inside a shared folder, /nfs/mpich3. All the machines connected via a switch or directly through Ethernet or Infiniband should have the binary and library paths configured correctly to run MPI binaries like mpirun. To configure your MPI paths, edit the ~/.bashrc.

vim ~/.bashrc
export PATH=/nfs/mpich3/bin:$PATH
export LD_LIBRARY_PATH=/nfs/mpich3/lib:$LD_LIBRARY_PATH

Save and exit. Then, to load the new ~/.bashrc file:

source ~/.bashrc

 

How Do You Actually Run an MPI?

First, you should create a hosts file with all the IPs that you want to run MPI on. All of these machines should have the MPI path configured properly.

vim /nfs/hosts

Inside this hosts file, you will include all the IPs including the machine that you are on:

To find the IP addr, use the following command:

ip addr show

Save and quit. Now, you’ll be ready to run your MPI command. The first thing to know about MPI is that the main binary is mpirun, and you specify how many cores you want to run the binary on. To determine how many cores are on each machine, run the following:

less /proc/cpuinfo

Scroll down and count how many times you see a processor. The number of processors represents the number of cores on the machine. To start, let’s say that I have a ./mpi_hello_world binary and 6 machines with 4 processors on each.

mpirun -f /nfs/hosts -n 24 ./mpi_hello_world

With the above command, I would have run the mpi hello world program across 24 cores on the IPs listed in the /nfs/hosts file. For sample MPI programs, use git to download this popular repository of samples:

git clone https://github.com/wesleykendall/mpitutorial

Check the tutorials folder inside mpitutorial, compile some of the programs, and try to use mpirun on the binaries!

 

Firewall Blocking MPI

After you feel that you have configured MPI properly, you may encounter an error where MPI cannot communicate with other nodes. We could check to see if we have the appropriate ports open, but the easy way is to drop the firewall for quick testing.

On CentOS, for every connected machine, run the following:

systemctl stop firewalld

Now, try your MPI command again!

 

Password Prompt

If when you run an MPI command, the machines ask you for the password, then, you haven’t set up your SSH keys properly. You should use SSH keys with the key being authorized on every node.

To generate an SSH key, first you should make sure that you have the ~/.ssh directory.

mkdir ~/.ssh

Check if you have an SSH key already.

ls ~/.ssh

If you see any id_rsa.pub or .pub files, that file is your SSH key. If not, then, you can generate a standard SSH key with:

ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

After making the SSH key, you can add the contents of ~/.ssh/id_rsa.pub into a new file called ~/.ssh/authorized_keys on every connected machine. If you already have this file, add to the bottom of it the contents of your ~/.ssh/id_rsa.pub file on a new line. With this ~/.ssh/id_rsa.pub in the ~/.ssh/authorized_keys, you should be able to SSH into every connected machine without the password prompt if they contain the authorized_keys verifying access for your pub file.

 

Host Key Verification Failed

Still, you might get a host verification problem! Let’s say that I have a file with all my IPs of every connected machine, /nfs/hosts, to be used with mpirun. When I try the command:

mpirun -f /nfs/hosts -n 4 ./mpi_hello_world
Host key verification failed.
Host key verification failed.
Host key verification failed.

SSH keys are set up. Firewall is shutdown. This problem happens because on the machine that you run MPI, you must ssh at least once to all the nodes in the /nfs/hosts file that are connected to the machine that you are running the mpi command. SSH into each of the machines whose IP is in the hosts file at least once because the main machine must have this history.

 

SSH to each node at least once

MPI might not work if you have not SSHed to each node at least once. Let’s say that my /nfs/hosts file contains 6 IP addresses. Let’s say that from node 1, I want to run:

mpirun -n 24 -f /nfs/hosts ./mpi_hello_world

Before I run the above command, I must SSH into nodes 2 – 6 from node 1 at least once to update ~/.ssh/known_hosts.

 

Unable to get host address

You might get the following error:

[proxy:0:1@buhpc4] HYDU_sock_connect (../../utils/sock/sock.c:224): unable to get host address for buhpc1 (1)
[proxy:0:1@buhpc4] main (../../pm/pmiserv/pmip.c:415): unable to connect to server buhpc1 at port 46951 (check for firewalls!)

In this scenario, I am using buhpc1 and buhpc4 for MPI. But wait. We already shut off the firewall on both machines. Well, this problem happens when you don’t have access to a working DNS server. A configuration might be wrong or you cannot access any public DNS server at all.

To fix the problem, you need to edit /etc/hosts on all the machines that you want to run MPI on.

vim /etc/hosts

Originally, it will look like this:

Then, you will add a couple of lines:

If you don’t have subdomains, then you can just leave the file like this:

After saving /etc/hosts on both machines with these settings, you should be able to run MPI as expected.