We’ve been working on a benchmark called HPL also known as High Performance LINPACK on our cluster. Our cluster is made of 6 nodes.
The specs are: 6 x86 nodes each with an Intel(R) Xeon (R) CPU 5140 @ 2.33 GHz, 4 cores, and no accelerators. Our OS is CentOS 7
. At first, we had difficulty improving HPL performance across nodes. For some reason, we would get the same performance with 1 node compared to 6 nodes. Here’s what we did to improve performance across nodes, but before we get into performance, let’s answer the big questions about HPL. For more information, visit the HPL FAQs.
What is HPL?
HPL measures the floating point execution rate for solving a system of linear equations. HPL is measured in FLOPs, which are floating point operations per second. The dependencies include MPI
and BLAS
.
Theoretical peak performance
When you run HPL, you will get a result with the number of FLOPs HPL took to complete. With benchmarks like HPL, there is something called the theoretical peak FLOPs/s, which is denoted by:
Number of cores * Average frequency * Operations per cycle
You will come below the theoretical peak FLOPs/second, but the theoretical peak is a good number to compare your HPL results. First, we’ll look at the number of cores we have.
cat /proc/cpuinfo
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 94 model name : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz stepping : 3 cpu MHz : 2592.000 cache size : 256 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes xsave osxsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: |
At the bottom of the cpuinfo
of my laptop, I see processor: 7
, which means that we have 8 cores. Processor core numbers start with 0. From the model name, I see that I have a Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
, which means the average frequency is 2.60GHz. For the operations per cycle, we need to dig deeper and search additional information about the architecture. Doing a Google search on Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, we find that the max frequency considering turbo is 3.5 GHz. After a little snooping on the page, I noticed a link stating Products formerly Skylake
. Skylake is a name of a microarchitecture. There’s a Stackoverflow question listing the operations per cycle for a number of recent processor microarchitectures. On the link, we see:
Intel Haswell/Broadwell/Skylake:
- 16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
- 32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
DP stands for double-precision, and SP stands for single-precision. Considering the CPU running of HPL, we would have a theoretical peak performance:
8 cores * 3.50 GHz * 16 FLOPs/cycle = 448 GFLOPS
You will have to do the same calculations for your GPU if you plan on running HPL with your GPU.
Why are your performance results below the theoretical peak?
The performance results depend on the algorithm, size of the problem, implementation, human optimizations to the program, compiler’s optimizations, age of the compiler, the OS, interconnect, memory, architecture, and the hardware. Basically, things aren’t perfect when running HPL, so you won’t hit the theoretical peak, but the theoretical peak is a good number to base your results on. At least 50% of your cluster’s theoretical peak performance with HPL would be an excellent goal.
Improving HPL Performance across Nodes
Very helpful notes on tuning HPL are available here. The HPL.dat
file resides inside hpl/bin/xhpl
. The file contains information on the problem size, machine configuration, and algorithm. In HPL.dat
, you can change:
N – size of the problem. The problem size is the largest problem size fitting in memory. You should fill up around 80% of total RAM as recommended by the HPL docs. If the problem size is too large, the performance will drop. Think about how much RAM you have. For instance, let’s say that I had 4 nodes with 256 MB of RAM each. In total, I have 1 GB of RAM. On our cluster, our peak performance for N is at 64000.
P – number of processes. One caveat is that P is less than Q.
Q – number of nodes. (P * Q is the total number of processes you can run on your cluster).
NBs – subset of N to distribute across nodes. NB is the block size, which is used for data distribution and data reuse. Small block sizes will limit the performance because there is less data reuse in the highest level of memory and more messaging. When block sizes are too big, we can waste space and extra computation for the larger sizes. HPL docs recommend 32 – 256. We used 256.
Our example run: N = 64000
We used an N that was a multiple of 256 because we noticed huge performance drop when NBs < 256.
P = 4, which is the max number of cores we have on each node.
Q = 5, which is the number of nodes we use. We chose 5 because our 6th node didn’t have Intel libraries at the time.
NBs = 256.
After editing the HPL.dat
and saving the file, you can test using MPI with HPL. Our /nfs/hosts2
file contains 5 IP addresses. To run HPL with mpirun:
mpirun -n 20 -f /nfs/hosts2 ./xhpl
You should get improved FLOP performance compared to running HPL on a single node.