We’ve been working on a benchmark called HPL also known as High Performance LINPACK on our cluster. Our cluster is made of 6 nodes.
The specs are: 6 x86 nodes each with an Intel(R) Xeon (R) CPU 5140 @ 2.33 GHz, 4 cores, and no accelerators. Our OS is CentOS 7. At first, we had difficulty improving HPL performance across nodes. For some reason, we would get the same performance with 1 node compared to 6 nodes. Here’s what we did to improve performance across nodes, but before we get into performance, let’s answer the big questions about HPL. For more information, visit the HPL FAQs.

What is HPL?

HPL measures the floating point execution rate for solving a system of linear equations. HPL is measured in FLOPs, which are floating point operations per second. The dependencies include MPI and BLAS.


Theoretical peak performance

When you run HPL, you will get a result with the number of FLOPs HPL took to complete. With benchmarks like HPL, there is something called the theoretical peak FLOPs/s, which is denoted by:

Number of cores * Average frequency * Operations per cycle

You will come below the theoretical peak FLOPs/second, but the theoretical peak is a good number to compare your HPL results. First, we’ll look at the number of cores we have.

cat /proc/cpuinfo

At the bottom of the cpuinfo of my laptop, I see processor: 7, which means that we have 8 cores. Processor core numbers start with 0. From the model name, I see that I have a Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, which means the average frequency is 2.60GHz. For the operations per cycle, we need to dig deeper and search additional information about the architecture. Doing a Google search on Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, we find that the max frequency considering turbo is 3.5 GHz. After a little snooping on the page, I noticed a link stating Products formerly Skylake. Skylake is a name of a microarchitecture. There’s a Stackoverflow question listing the operations per cycle for a number of recent processor microarchitectures. On the link, we see:

Intel Haswell/Broadwell/Skylake:

  • 16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
  • 32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions

DP stands for double-precision, and SP stands for single-precision. Considering the CPU running of HPL, we would have a theoretical peak performance:

8 cores * 3.50 GHz * 16 FLOPs/cycle =  448 GFLOPS

You will have to do the same calculations for your GPU if you plan on running HPL with your GPU.


Why are your performance results below the theoretical peak?

The performance results depend on the algorithm, size of the problem, implementation, human optimizations to the program, compiler’s optimizations, age of the compiler, the OS, interconnect, memory, architecture, and the hardware. Basically, things aren’t perfect when running HPL, so you won’t hit the theoretical peak, but the theoretical peak is a good number to base your results on. At least 50% of your cluster’s theoretical peak performance with HPL would be an excellent goal.


Improving HPL Performance across Nodes

Very helpful notes on tuning HPL are available here. The HPL.dat file resides inside hpl/bin/xhpl. The file contains information on the problem size, machine configuration, and algorithm. In HPL.dat, you can change:

N – size of the problem. The problem size is the largest problem size fitting in memory. You should fill up around 80% of total RAM as recommended by the HPL docs.  If the problem size is too large, the performance will drop. Think about how much RAM you have. For instance, let’s say that I had 4 nodes with 256 MB of RAM each. In total, I have 1 GB of RAM. On our cluster, our peak performance for N is at 64000.

P – number of processes. One caveat is that P is less than Q.

Q – number of nodes. (P * Q is the total number of processes you can run on your cluster).

NBs – subset of N to distribute across nodes. NB is the block size, which is used for data distribution and data reuse. Small block sizes will limit the performance because there is less data reuse in the highest level of memory and more messaging. When block sizes are too big, we can waste space and extra computation for the larger sizes. HPL docs recommend 32 – 256. We used 256.

Our example run: N = 64000

We used an N that was a multiple of 256 because we noticed huge performance drop when NBs < 256.

P = 4, which is the max number of cores we have on each node.

Q = 5, which is the number of nodes we use. We chose 5 because our 6th node didn’t have Intel libraries at the time.

NBs = 256.

After editing the HPL.dat and saving the file, you can test using MPI with HPL. Our /nfs/hosts2 file contains 5 IP addresses. To run HPL with mpirun:

mpirun -n 20 -f /nfs/hosts2 ./xhpl

You should get improved FLOP performance compared to running HPL on a single node.