How to use MPI without NFS

You can use MPI without NFS or a shared file system! We had a situation where we couldn’t find the NFS server or client packages for arm64 for Ubuntu 16.04. We had OpenMPI version 1.10.2 installed on 2 nodes without NFS.

When you use MPI without NFS, you need to ensure that the same version of MPI is installed by every node.

Then, you have to ensure that the same data files, which include the program, hostnames file, and input files, are on every node at the same location relative to that node.

Lastly, you should double-check that you have the same SSH key and have SSHed onto every node from the node that you’re running the program at least once.

Step 1) Ensure that the same version of MPI is installed on every node.

We can check the location of where OpenMPI or any version of MPI is on every node.

which mpicc
 /opt/arm/openmpi-1.10.2_Cortex-A57_Ubuntu-14.04_aarch64-linux/bin/mpicc

Make sure that this directory is consistent on each node. Now, we should check that the ~/.bashrc sets the same OpenMPI PATH folder on every node.

vi ~/.bashrc

Add the following line somewhere in the file if it is not already there.

export PATH="/opt/arm/openmpi-1.10.2_Cortex-A57_Ubuntu-14.04_aarch64-linux/bin:$PATH"

If you just added the above line, you will need to reload your ~/.bashrc. Do this for every node.

source ~/.bashrc

We can double check the version of MPI on every node by running:

mpirun --version
mpirun (Open MPI) 1.10.2
Report bugs to http://www.open-mpi.org/community/help/

Step 2) Same data files on every node.

What I would suggest is to compile the application or program on one node and send the program to all the other nodes because you must have the same exact program on every node in the same location to run mpirun without NFS properly.

For instance, let’s say on node 1 called tegra1-ubuntu, I will compile a basic MPI hello world program.

cd ~
pwd
/ubuntu/home

Now, we use git to download the mpi hello world program.

git clone https://github.com/huyle333/mpi-hello-world
cd mpi-hello-world

We will need to compile the program. My suggestion is to use the full path of mpicc to compile the program. We know the full path of mpicc already since we used which mpicc earlier.

 /opt/arm/openmpi-1.10.2_Cortex-A57_Ubuntu-14.04_aarch64-linux/bin/mpicc -o mpi_hello_world mpi_hello_world.c

We will send the compiled binary, mpi_hello_world, to the same location as this node to all the other nodes.

scp mpi_hello_world [email protected]:~

Now, we will have to create a file that contains the hostnames or IP addresses of the 2 nodes.

cd ~
vi hostnames

Instead of hostnames, you can put the IP addresses.

tegra1-ubuntu
tegra2-ubuntu

We send the hostnames file to the other node.

scp hostnames [email protected]:~

Step 3) Make sure that the SSH key is the same on every node.

First, we check if we have an SSH key.

ls ~/.ssh
id_rsa id_rsa.pub known_hosts authorized_keys
cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDJxIA4WSnXiJEWZ16SrRgGKOoIS6Z2sHSZreGKDggf+aJ2unEP5vtnFq07fmKDDxG+nMipTFpzx0bMB5ysXNZaTpnEKmW76BaO7402J/bIf/HsqZBMip39d+swkXkq9NB5yCHSn7+kmzf5PKaL34X8cNLOK6I5IZrqrHj8b10JyhORJ8URxa0VltItsblCvTUrdW5grR0+O8aY3UyzaZXLIwwYBF/vrQnt/bcPSA3j6lW829pUz+XsYOsKeit7aUep+ek0q1F3SYuPUoPe7vwp8+X+TiGBQTbraynZHVEov0ZJwWojw89Xc42qGtAiW1N+NrxkuaNXvJIHpua3ZCUdfJUXLlXfhOpFWZxU7F/C32Rj6x7kz6HJrjXkTaV3UD8puh7J2oVW8sGVOoKk99KPN0bztL//sj8UDVSD8rHxl5FanCHqBICIF+ZBrqcG6v3ElNcAq/KxpVEpypZndYa+FOwXvXJfBMg5IbDzgWXy6WAuK8bI8Iavk5UeRmAOGDvJzXG/30N06lmkQKnZYhtTQ4LY10Y0lbkNSCys7ceimRB3YKbVaoSxdbTiWzhNP2a7XTTmG/b1P022HdEYsZ9+9+iwyXRINmcvT3J+8QSsLryd3u/G5kWVX9iHnFPbEt3TRCZwJLkoQXxN0OTGFveaQpjMsui6Wpu3RKdcKMzY/w== [email protected]

We make sure that the contents of id_rsa.pub are somewhere inside the authorized_keys file. authorized_keys will give SSH access to any node as long as id_rsa.pub is inside the file.

cat ~/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDJxIA4WSnXiJEWZ16SrRgGKOoIS6Z2sHSZreGKDggf+aJ2unEP5vtnFq07fmKDDxG+nMipTFpzx0bMB5ysXNZaTpnEKmW76BaO7402J/bIf/HsqZBMip39d+swkXkq9NB5yCHSn7+kmzf5PKaL34X8cNLOK6I5IZrqrHj8b10JyhORJ8URxa0VltItsblCvTUrdW5grR0+O8aY3UyzaZXLIwwYBF/vrQnt/bcPSA3j6lW829pUz+XsYOsKeit7aUep+ek0q1F3SYuPUoPe7vwp8+X+TiGBQTbraynZHVEov0ZJwWojw89Xc42qGtAiW1N+NrxkuaNXvJIHpua3ZCUdfJUXLlXfhOpFWZxU7F/C32Rj6x7kz6HJrjXkTaV3UD8puh7J2oVW8sGVOoKk99KPN0bztL//sj8UDVSD8rHxl5FanCHqBICIF+ZBrqcG6v3ElNcAq/KxpVEpypZndYa+FOwXvXJfBMg5IbDzgWXy6WAuK8bI8Iavk5UeRmAOGDvJzXG/30N06lmkQKnZYhtTQ4LY10Y0lbkNSCys7ceimRB3YKbVaoSxdbTiWzhNP2a7XTTmG/b1P022HdEYsZ9+9+iwyXRINmcvT3J+8QSsLryd3u/G5kWVX9iHnFPbEt3TRCZwJLkoQXxN0OTGFveaQpjMsui6Wpu3RKdcKMzY/w== [email protected]

Make sure that the 2nd node has the ~/.ssh directory.

ssh [email protected]
ls ~/.ssh
known_hosts
exit

Back on node 1, we will send the SSH public and private key and authorized_keys file to node 2.

cd ~/.ssh
scp id_rsa id_rsa.pub authorized_keys [email protected]:~/.ssh

Step 4) Run the mpi program with full paths.

To make sure that the mpi program runs properly without NFS, make sure that you run the program with the full path of the MPI binaries and hostnames.

[email protected]:~$ /opt/arm/openmpi-1.10.2_Cortex-A57_Ubuntu-14.04_aarch64-linux/bin/mpirun --hostfile /ubuntu/home/hostnames -n 8 /ubuntu/home/mpi-hello-world/mpi_hello_world
Hello world from processor tegra1-ubuntu, rank 2 out of 8 processors
Hello world from processor tegra1-ubuntu, rank 3 out of 8 processors
Hello world from processor tegra1-ubuntu, rank 1 out of 8 processors
Hello world from processor tegra1-ubuntu, rank 0 out of 8 processors
Hello world from processor tegra2-ubuntu, rank 7 out of 8 processors
Hello world from processor tegra2-ubuntu, rank 5 out of 8 processors
Hello world from processor tegra2-ubuntu, rank 6 out of 8 processors
Hello world from processor tegra2-ubuntu, rank 4 out of 8 processors

You should see that the hello world is processed on both nodes! If you have input files used by your program, make sure that they are also in the same location on both nodes.

Leave a comment if you have any questions.

How to Fix OpenMPI ORTE Error: unknown option “–hnp-topo-sig”

We encountered an ORTE bug with the error message, Error: unknown option "--hnp-topo-sig", while using OpenMPI version 1.10.2 for arm64 on Ubuntu 14.04 server. More specifically, we ran the following command using 2 nodes with MPI:

mpirun --hostfile /nfs/hostnames -n 4 /nfs/mpi-hello-world/mpi_hello_world

ORTE errors can happen because of a variety of different things, but it’s usually because your mpirun is not the same version as your mpi compiler. Even if you think that you only have one MPI version, you may in fact have multiple versions of MPI.

How we fixed our problem?

First, we checked where our MPI C compiler was located.

which mpicc
/opt/arm/openmpi-1.10.2_Cortex-A57_Ubuntu-14.04_aarch64-linux/bin/mpicc

Nothing odd here. The location of the MPI C compiler was where we expected. But, we had to check if we actually had multiple versions of MPI.

mpicc (press tab twice)
mpicc mpicc.openmpi

We saw a second version of the MPI C compiler on our machine! If you have the ORTE error and you indeed have two versions of MPI, you should use full paths when using mpirun. Let’s see if it works.

Test a small MPI program on 2 nodes

First, we change directory into our NFS shared folder.

cd /nfs
git clone https://github.com/huyle333/mpi-hello-world
cd mpi-hello-world

We want to compile the mpi-hello-world program with the full path of the MPI compiler that we are expecting.

/opt/arm/openmpi-1.10.2_Cortex-A57_Ubuntu-14.04_aarch64-linux/bin/mpicc -o mpi_hello_world mpi_hello_world.c

mpi_hello_world is the created binary. Now, we test if we can use mpirun with 1 node. Use the full path of the mpirun command.

which mpirun
/opt/arm/openmpi-1.10.2_Cortex-A57_Ubuntu-14.04_aarch64-linux/bin/mpirun
 [email protected]:~/nfs/mpi-hello-world$ /opt/arm/openmpi-1.10.2_Cortex-A57_Ubuntu-14.04_aarch64-linux/bin/mpirun -n 4 /nfs/mpi-hello-world/mpi_hello_world
Hello world from processor tegra1-ubuntu, rank 0 out of 4 processors
Hello world from processor tegra1-ubuntu, rank 1 out of 4 processors
Hello world from processor tegra1-ubuntu, rank 2 out of 4 processors
Hello world from processor tegra1-ubuntu, rank 3 out of 4 processors

Using mpirun on 1 node seems to work fine. Okay, now let’s try 2 nodes. /nfs/hostnames contains 2 IP addresses of the nodes that I want to use.

[email protected]:~/nfs/mpi-hello-world$ /opt/arm/openmpi-1.10.2_Cortex-A57_Ubuntu-14.04_aarch64-linux/bin/mpirun --hostfile /nfs/hostnames -n 8 /nfs/mpi-hello-world/mpi_hello_world
Hello world from processor tegra1-ubuntu, rank 2 out of 8 processors
Hello world from processor tegra1-ubuntu, rank 3 out of 8 processors
Hello world from processor tegra1-ubuntu, rank 1 out of 8 processors
Hello world from processor tegra1-ubuntu, rank 0 out of 8 processors
Hello world from processor tegra2-ubuntu, rank 7 out of 8 processors
Hello world from processor tegra2-ubuntu, rank 5 out of 8 processors
Hello world from processor tegra2-ubuntu, rank 6 out of 8 processors
Hello world from processor tegra2-ubuntu, rank 4 out of 8 processors

Eureka! It works because we can see that hello world is triggered on both tegra1-ubuntu and tegra2-ubuntu.

Now for your actual program, use full paths to make sure that you are not running a different version of mpirun, and it should work.

How to Compile HPCG

HPCG, which stands for High Performance Conjugate Gradients, is a benchmark project to create a new metric for ranking HPC systems. HPCG measures the performance of basic operations including sparse matrix-vector multiplication, sparse triangular solve, vector updates, global dot products and more. The implementation is written in C++ with MPI and OpenMP support.

http://www.hpcg-benchmark.org/software/index.html

 

HPCG Reference Code

There are versions of HPCG optimized for NVIDIA GPUs or Intel XEON Phis. For this blog post, I’ll show you how to compile HPCG 3.0 Reference Code. Get the latest version with your desired optimizations.

ssh [email protected]
cd /nfs
wget http://www.hpcg-benchmark.org/downloads/hpcg-3.0.tar.gz
tar -xvf hpcg-3.0.tar.gz
cd hpcg-3.0/

The INSTALL file contains very useful instructions for compilation. First, we need to compose our Makefile.

cd setup/
ls

You will see a bunch of Makefiles with extension names that are geared towards whether you have certain libraries. I am assuming that you have the MPI library, so we’ll select Make.MPI_GCC_OMP, which stands for MPI, gcc, and Open MPI, as our base file. I’ll copy the file as a new name so that we have a fresh Makefile for adjusting to our cluster.

cp Make.MPI_GCC_OMP Make.MPI_OPENMP
vim Make.MPI_OPENMP

Scroll down to the – Message Passing Library (MPI) – section. We will need to setup these variables. For my cluster:

You can adjust other parameters to fit your needs, but for now, we’ll just make sure that hpcg can find MPI. Now, we should test our new Makefile.

cd ../
mkdir build_MPI_OPENMP
cd build_MPI_OPENMP/

We use the configure binary with its full path and select the MPI_OPENMP extension.

/nfs/hpcg-3.0/configure MPI_OPENMP
make

make will create an executable called xhpcg within /nfs/hpcg-3.0/build_MPI_OPENMP folder/bin.

cd bin/

hpcg.dat contains the parameters with the dimensions that you want to run xhpcg and the duration in amount of seconds. For HPCG Reference Code, we have found that a 16 16 16 matrix is the representative of the best performance. You will need to run xhpcg for greater than an hour for official results. Do not use exactly 1800 seconds if you want official results. You want to have at least 2 minutes more than 30 minutes to get official results to consider system real time.

But before you even run xhpcg for 30 minutes, test for 60 seconds first. We have found that running the benchmark at 60 seconds is a decent indicator of its performance scaled to 30 minutes We did not find any statistically significant performance increase or decrease from scaling time. You can specify the parameters available in the hpcg.dat file through the command as well. Here’s how you run HPCG with MPI.

mpirun -n 24 -f /root/nfs/hosts ./xhpcg --nx=16 --rt=60

–nx is equal to the dimension of x. The dimensions must be divisible by 8.
–rt is equal to the number of seconds of the runtime.

If you do not use ./xhpcg flags, the dimensions and duration will be taken from the values of hpcg.dat. After the benchmark finishes running in 60 seconds, you’ll find .yaml files. The .yaml file contains the results. To find the performance, scroll to the bottom of the .yaml file and see:

HPCG result is VALID with a GFLOP/s rating of: ...

The number of GFLOP/s is the indicator of your performance. Test for different dimension sizes, and once you have found a set of dimensions that you like, scale the timing to at least 32 minutes. I prefer running HPCG for an hour.

nohup mpirun -n 24 -f /root/nfs/hosts ./xhpcg --nx=16 --rt=3600 &

Hit enter, and the process will run in the background until completion or killed. See how we setup our configurations for HPCG Reference Code here:

https://github.com/BUILDS-/hpc/tree/master/hpcg

Let me know if you have any problems setting up HPCG for the first time.

 

Intel® MKL Benchmarks – includes optimized HPCG for Intel XEON and Phi

We do not have a local cluster that can run HPCG for Intel XEON and Phi. Here’s what we do know though. You can grab the link for optimized HPCG for Intel XEON and Phi at the HPCG software releases page.

This optimized version of HPCG is quite different, and you must have Intel XEON and/or Phi. First, let’s talk about the Make files. The extensions of the Makefiles have some familiar names like IMPI (Intel MPI), MPICH, OPENMPI, etc. But the final underscore extensions are different. What do they stand for?

If you run an AVX2 Makefile when your CPU doesn’t support AVX2, then you will get the following error:

How to Setup the Intel Compilers on a Cluster

Intel compilers like icc or icl are very useful for any cluster with Intel processors. They’ve been known to produce very efficient numerical code. If you are still a student, you can grab the student Intel Parallel Studio XE Cluster Edition, which includes Fortran and C/C++ for free for a year. Here’s our experience. If you need more information, definitely check out the official Intel Parallel Studio XE Cluster Edition guide.

Dependencies

You should have the GCC C and C++ compilers on your machine. I am using CentOS 7. You will need to install GCC C and C++ compilers on all the machines.

yum install gcc
yum install gcc-c++

 

Getting the Intel compilers and MPI libraries

I’m going to grab the student  Intel® Parallel Studio XE Cluster Edition for Linux, which lasts for a year. First thing to do is to join the Intel Developer Zone at the following link:

https://software.intel.com/registration/?lang=en-us

Fill your information and choose an Intel User ID to create. Now, you’ll have an account, but you’ll need to be a student to get the Intel compilers for free at:

https://software.intel.com/en-us/qualify-for-free-software/student

Click on Linux underneath Intel Parallel Studio XE Cluster Edition. Check the items on the next page and fill your e-mail before submitting. After submitting, you’ll receive an e-mail labeled “Thank You for Your Interest in the Intell® Software Development Products.”

The e-mail contains a product serial number that should last a year. The e-mail also contains a DOWNLOAD button that you should click.

After visiting the link, you’ll be brought to Intel® Parallel Studio XE Cluster Edition for Linux*. I prefer the Full Offline Installer Package (3994 MB). If you choose the Full Offline Installer Package, you will need to stay on that link and acquire your license file. In the red text, you’ll see the following sentence:

"If you need to acquire your license file now, for offline installation, please click here to provide your host information and download your license file."

Once you click the here link, you’ll be brought into a Sign In page to download your license file. After signing in, you’ll see your licenses that you can download. Download your license file or e-mail it to yourself. If you download the license, it should be a lic file.

At this point, you should have downloaded two files. parallel_studio_xe_2016_update2.tgz contains the zipped archive of the Intel Parallel Studio XE Cluster Edition, and NCOM….lic is your license.

parallel_studio_xe_2016_update2.tgz
NCOM....lic

You should upload these two files to the shared folder of your cluster. My shared folder is /nfs, so I’ll be sending those two files to my /nfs folder.

scp parallel_studio_xe_2016_update2.tgz NCOM...lic [email protected]:/nfs

Now, you can extract the tgz file by running:

ssh [email protected]
cd /nfs
tar -xvf parallel_studio_xe_2016_update2.tgz

We will put the license file as Licenses in /root.

mv NCOM....lic /root/Licenses

 

Activation

Now, we will set up the Intel compilers and MPI libraries.

cd parallel_studio_xe_2016_update2
./install.sh

It should say Initializing, please wait… until a text GUI pops up for installation. Type the number option that installs the installation.

First, we need to activate. Hit 3 and press Enter.

Step 2 of 7 | License agreement
[Press space to continue, 'q' to quit.]

After pressing space a bunch of times, you’ll reach the end of the license.

Type 'accept' to continue or 'decline' to go back to the previous menu:

Type “accept.”

Please type a selection or press "Enter" to accept default choice [1]: Please type your serial number (the format is XXXX-XXXXXXXX):

In another terminal, check the serial number, which will be inside /root/Licenses.

 

Install

Hit enter and the correct number options for the Intel compilers and libraries to install. You’ll see the installation of Intel MPI Benchmarks, Libraries, C++ Compiler, Fortran Compiler, and more. Using the install.sh script is the sure way to make that all the Intel libraries are installed correctly, but if you really only want specific libraries, then you’ll have to select which ones you want to install inside the rpm/ folder. The full installation may take 15 minutes or more.

Press "Enter" key to continue:
Press "Enter" key to quit:

As for the final step, the paths for Intel may not be set up automatically. I am using CentOS 7 64 bit, so I’ll have to setup the environment for Intel 64 bit. We’ll have to adjust our ~/.bashrc.

vim ~/.bashrc

Add to the end of the file the following:

Save and quit. Note: your directories may be slightly different based on the version of Intel Parallel Studio XE Cluster Edition you installed. Adjust those directories accordingly by searching whether the directories match.

source ~/.bashrc

Now, you should be able to access and use the Intel compilers as expected.

 

RLIMIT_MEMLOCK too small

When you first run your mpirun command with the Intel Parallel Studio XE Cluster Edition, you may receive an error about RLIMIT_MEMLOCK being too small.

mpirun -n 8 -f /nfs/hosts2 ./xhpcg --nx=16 --rt=60

The problem is that memory lock is set statically, and it’s too small. For every machine that you want to use MPI, we should set memory lock to unlimited.

ulimit -l unlimited
ulimit -l

If the second command says unlimited, we’ve set memory lock to unlimited. Now, we have to make sure that it’s unlimited on every startup instance.

vi /etc/security/limits.conf

Go to the bottom of the file and add the following:

*            hard   memlock           unlimited
*            soft   memlock           unlimited

Save and quit. Now, if you run the MPI command again, you should not encounter any problems.

 

Missing Hydra Files

You may come across an error where you can have missing hydra files. When you run mpirun, you may get:

bash: /usr/local/bin/hydra_pmi_proxy: No such file or directory

How I fixed the problem was: I downloaded MPICH, a different MPI library, and compiled it with these instructions. hydra binaries for MPICH should work with Intel MPI because they’re both the same process manager.  I copied MPICH’s hydra binaries to a directory that was also added to the ~/.bashrc PATH.

cp /nfs/mpich2/bin/hydra_persist /nfs/mpich2/bin/hydra_nameserver /nfs/mpich2/bin/hydra_pmi_proxy /usr/local/bin

Then, I added /usr/local/bin to the ~/.bashrc PATH.

vim ~/.bashrc

Add the following line:

export PATH=/usr/local/bin:$PATH

Save the file. And then reload ~/.bashrc.

source ~/.bashrc

Do this for all the nodes where you are missing hydra_pmi_proxy. Afterwards, if you run mpirun again, it should work!

Running MPI – Common MPI Troubleshooting Problems

In this post, I’ll list some common troubleshooting problems that I have experienced with MPI libraries after I compiled MPICH on my cluster for the first time. The following assumes that:

  1. You have at least 2 nodes as part of your cluster.
  2. You have MPI compiled inside a NFS (Network File System), a shared folder.

I will divide the common problems into separate sections. For my first installation of an MPI library, I was using http://www.mpich.org/downloads/

 

MPI Paths on each Node

I placed my MPICH library and binaries inside a shared folder, /nfs/mpich3. All the machines connected via a switch or directly through Ethernet or Infiniband should have the binary and library paths configured correctly to run MPI binaries like mpirun. To configure your MPI paths, edit the ~/.bashrc.

vim ~/.bashrc
export PATH=/nfs/mpich3/bin:$PATH
export LD_LIBRARY_PATH=/nfs/mpich3/lib:$LD_LIBRARY_PATH

Save and exit. Then, to load the new ~/.bashrc file:

source ~/.bashrc

 

How Do You Actually Run an MPI?

First, you should create a hosts file with all the IPs that you want to run MPI on. All of these machines should have the MPI path configured properly.

vim /nfs/hosts

Inside this hosts file, you will include all the IPs including the machine that you are on:

To find the IP addr, use the following command:

ip addr show

Save and quit. Now, you’ll be ready to run your MPI command. The first thing to know about MPI is that the main binary is mpirun, and you specify how many cores you want to run the binary on. To determine how many cores are on each machine, run the following:

less /proc/cpuinfo

Scroll down and count how many times you see a processor. The number of processors represents the number of cores on the machine. To start, let’s say that I have a ./mpi_hello_world binary and 6 machines with 4 processors on each.

mpirun -f /nfs/hosts -n 24 ./mpi_hello_world

With the above command, I would have run the mpi hello world program across 24 cores on the IPs listed in the /nfs/hosts file. For sample MPI programs, use git to download this popular repository of samples:

git clone https://github.com/wesleykendall/mpitutorial

Check the tutorials folder inside mpitutorial, compile some of the programs, and try to use mpirun on the binaries!

 

Firewall Blocking MPI

After you feel that you have configured MPI properly, you may encounter an error where MPI cannot communicate with other nodes. We could check to see if we have the appropriate ports open, but the easy way is to drop the firewall for quick testing.

On CentOS, for every connected machine, run the following:

systemctl stop firewalld

Now, try your MPI command again!

 

Password Prompt

If when you run an MPI command, the machines ask you for the password, then, you haven’t set up your SSH keys properly. You should use SSH keys with the key being authorized on every node.

To generate an SSH key, first you should make sure that you have the ~/.ssh directory.

mkdir ~/.ssh

Check if you have an SSH key already.

ls ~/.ssh

If you see any id_rsa.pub or .pub files, that file is your SSH key. If not, then, you can generate a standard SSH key with:

ssh-keygen -t rsa -b 4096 -C "[email protected]"

After making the SSH key, you can add the contents of ~/.ssh/id_rsa.pub into a new file called ~/.ssh/authorized_keys on every connected machine. If you already have this file, add to the bottom of it the contents of your ~/.ssh/id_rsa.pub file on a new line. With this ~/.ssh/id_rsa.pub in the ~/.ssh/authorized_keys, you should be able to SSH into every connected machine without the password prompt if they contain the authorized_keys verifying access for your pub file.

 

Host Key Verification Failed

Still, you might get a host verification problem! Let’s say that I have a file with all my IPs of every connected machine, /nfs/hosts, to be used with mpirun. When I try the command:

mpirun -f /nfs/hosts -n 4 ./mpi_hello_world
Host key verification failed.
Host key verification failed.
Host key verification failed.

SSH keys are set up. Firewall is shutdown. This problem happens because on the machine that you run MPI, you must ssh at least once to all the nodes in the /nfs/hosts file that are connected to the machine that you are running the mpi command. SSH into each of the machines whose IP is in the hosts file at least once because the main machine must have this history.

 

SSH to each node at least once

MPI might not work if you have not SSHed to each node at least once. Let’s say that my /nfs/hosts file contains 6 IP addresses. Let’s say that from node 1, I want to run:

mpirun -n 24 -f /nfs/hosts ./mpi_hello_world

Before I run the above command, I must SSH into nodes 2 – 6 from node 1 at least once to update ~/.ssh/known_hosts.

 

Unable to get host address

You might get the following error:

[proxy:0:[email protected]] HYDU_sock_connect (../../utils/sock/sock.c:224): unable to get host address for buhpc1 (1)
[proxy:0:[email protected]] main (../../pm/pmiserv/pmip.c:415): unable to connect to server buhpc1 at port 46951 (check for firewalls!)

In this scenario, I am using buhpc1 and buhpc4 for MPI. But wait. We already shut off the firewall on both machines. Well, this problem happens when you don’t have access to a working DNS server. A configuration might be wrong or you cannot access any public DNS server at all.

To fix the problem, you need to edit /etc/hosts on all the machines that you want to run MPI on.

vim /etc/hosts

Originally, it will look like this:

Then, you will add a couple of lines:

If you don’t have subdomains, then you can just leave the file like this:

After saving /etc/hosts on both machines with these settings, you should be able to run MPI as expected.