Graph500 on CUDA

I recently went to ISC16 for their Student Cluster Competition, and one of the challenges was to create our “own implementation of Graph500 to run on a cluster.”

If you don’t know about the Student Cluster Competition, they are cluster competitions where student teams work with vendors to create a cluster and optimize high performance scientific applications to be run under 3000 watts of power on real datasets.

Graph500 is a rating of supercomputer systems focused on data intensive loads. There are two main kernels.

The first kernel constructs an undirected graph. The second kernel performs breadth-first search of the graph. Both kernels are timed.

There are a bunch of other nitty-gritty requirements outlined in their full specifications page.

The Graph500 reference code and implementations only contain sequential, OpenMP, XMT, and MPI versions.

The original developers have provided CPU implementations, but where’s the CUDA version? No one wants to put their awesome versions of Graph500 open source!

 

Existing Open Source CUDA Graph500

While searching to find Graph500 on CUDA, we found only one open source version provided by the Suzumura Laboratory.

The Suzumura Laboratory has done a great contribution to the open source community on Graph500 with their papers, “Parallel Distributed Breadth First Search on GPU” and “Highly Scalable Graph Search for the Graph500 Benchmark” written by Koji Ueno and Toyotaro Suzumura.

Their version was created on June 2012.

First thing that we wanted our Graph500 to be was open source.

The HPC Advisory Council states that “Other implementations of Graph500 exist and likely to improve performance, however not freely obtainable.”

graph500-implementations

 

Our version of Graph500 on CUDA with MPI

We made a much simpler model for Graph500 that you may want to check for understanding the Graph500 specifications easier. We created ours on June 2016.

https://github.com/buhpc/isc16-graph500

We’ll be updating this post to explain how we created our version of Graph500 on a later day, but we hope that our source code will help you run Graph500 on your cluster with NVIDIA GPUs or create a version yourselves!

 

Testing our version of Graph500

We will test our version of Graph500 on a single NVIDIA Jetson TX1. Below are the NVIDIA Jetson TX1 specifications:

nvidia-jetson-tx1-specifications

The prequisite for running our version of Graph500 is having CUDA. Our NVIDIA Jetson TX1 already had CUDA and OpenMPI installed when we set up Ubuntu 14.04. To check if you have CUDA setup, run:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Thu_May__5_22:52:38_CDT_2016
Cuda compilation tools, release 7.0, V7.0.74

To check if you have MPI setup, run:

mpirun --version
mpirun (Open MPI) 1.6.5

Report bugs to http://www.open-mpi.org/community/help/

If you don’t have CUDA or MPI, you will need to Google how to do it for your respective operating system. Now, we have to install git to download our source code.

sudo apt-get install git -y

After you install git, you can git clone our repository.

git clone https://github.com/buhpc/isc16-graph500
cd isc16-graph500/

Make any changes to the Makefile to fit your location of CUDA.

vim Makefile

I have to make a small change to the LDFLAGS value. lib should be lib64 for me.

Original:

CC=mpicxx
FLAGS=-std=c++11
INCLUDE= -Iinclude -I/usr/local/cuda/include
LDFLAGS=-L/usr/local/cuda/lib
LIB=-lcudart
EXE=main
NVCC=nvcc
[email protected]

New:

CC=mpicxx
FLAGS=-std=c++11
INCLUDE= -Iinclude -I/usr/local/cuda/include
LDFLAGS=-L/usr/local/cuda/lib64
LIB=-lcudart
EXE=main
NVCC=nvcc
[email protected]
make
CXX src/constructGraph.cpp
CXX src/graph.cpp
CXX src/edgeList.cpp
CXX src/init.cpp
CXX src/breadthFirstSearch.cpp
CXX src/main.cpp
CXX src/generateKey.cpp
CXX src/validation.cpp
NVCC src/buildAdjMatrix.cu
NVCC src/bfsStep.cu
CXX main

The Graph500 binary created is called main.

[USAGE] ./main <config.ini> <scale> <edgefactor>

N
the total number of vertices, 2^SCALE.

M
the number of edges. M = edgefactor * N.

We will use main with mpirun and keep track of runtime with the time command. You may supply a hostfile. Check run.sh for a sample command to run the program.

In this example, we’ll use a SCALE of 6 and edgefactor of 1 for a result of 64 vertices and 64 edges. But first, how many cores we can use?

grep -c ^processor /proc/cpuinfo
4

Now, we will run Graph500 on a single node with a very small graph.

time mpirun -n 4 ./main config.ini 6 1
Constructing graph...
Done.

Running 64 BFSs...
Got 24291.5 TEPS
Got 29580.9 TEPS
Got 60708.3 TEPS
Got 35433.1 TEPS
Got 62827.2 TEPS
Got 36697.2 TEPS
Got 54054.1 TEPS
Got 51428.6 TEPS
Got 53892.2 TEPS
Got 38461.5 TEPS
Got 53491.8 TEPS
Got 54298.6 TEPS
Got 42452.8 TEPS
Got 41142.9 TEPS
Got 54628.2 TEPS
Got 32727.3 TEPS
Got 54711.2 TEPS
Got 35928.1 TEPS
Got 62827.2 TEPS
Got 39955.6 TEPS
Got 63492.1 TEPS
Got 36108.3 TEPS
Got 58919.8 TEPS
Got 70588.2 TEPS
Got 42007 TEPS
Got 34515.8 TEPS
Got 54380.7 TEPS
Got 36659.9 TEPS
Got 58631.9 TEPS
Got 63492.1 TEPS
Got 53973 TEPS
Got 53491.8 TEPS
Got 69632.5 TEPS
Got 41237.1 TEPS
Got 60200.7 TEPS
Got 36363.6 TEPS
Got 62283.7 TEPS
Got 42402.8 TEPS
Got 48913 TEPS
Got 40000 TEPS
Got 69632.5 TEPS
Got 42755.3 TEPS
Got 61120.5 TEPS
Got 40678 TEPS
Got 48192.8 TEPS
Got 37228.5 TEPS
Got 55900.6 TEPS
Got 46272.5 TEPS
Got 3831.42 TEPS
Got 41618.5 TEPS
Got 60301.5 TEPS
Got 40540.5 TEPS
Got 70312.5 TEPS
Got 42553.2 TEPS
Got 48979.6 TEPS
Got 49792.5 TEPS
Got 53175.8 TEPS
Got 53254.4 TEPS
Got 63380.3 TEPS
Got 2178.65 TEPS
Got 59308.1 TEPS
Done.

real 0m0.752s
user 0m1.000s
sys 0m0.680s

We measure TEPS, traversed edges per second. Let m be the number of input edge tuples within the component traversed by the search, counting any multiple edges and self-loops. Let timeK2(n) be the measured execution time for kernel 2.

TEPS(n) = m / timeK2(n)

Let us know if you have any questions. Feel free to fork our repository and improve our Graph500 code! Our output may not be exactly the same as the official Graph500 implementation, but it does fit the specifications as far as we know.

How to Setup CUDA 7.0 on NVIDIA Jetson TX1 with JetPack – Detailed

The most recent version of NVIDIA JetPack is 2.2., which supports the NVIDIA Jetson TX1 and Jetson TK1. The big news is that the latest version of JetPack 2.2 turns the userspace to 64 bit! In earlier versions of JetPack, the kernel was 64 bit, but the userspace was 32 bit apparently from what a source has told me.

Now with the userspace at 64 bit, you’ll have an easier time compiling and running arm64 libraries. Note that we’ll be flashing our NVIDIA Jetson TX1, so everything on it will be formatted. Remember to back up your files!

We made an earlier post last year on how to run CUDA 7.0 on NVIDIA Jetson TX1. In this post, we’ll outline very detailed instructions on setting up CUDA 7.0 for the NVIDIA Jetson TX1s from start to finish.

Requirements

  1. NVIDIA Jetson TX1, AC adapter, and WiFi antennas
  2. HDMI cable and monitor
  3. Computer with Ubuntu 14.04 or Laptop with VirtualBox
  4. Micro-B to USB Cable
  5. Keyboard

Step 1) We create an Ubuntu 14.04 x86 64-bit virtual machine with at least 15 GB of space to be safe.

I’m using VirtualBox to create the Ubuntu 14.04 x86 64-bit virtual machine. If you have an Ubuntu 14.04 x86 64-bit host operating system, you do not have to create the virtual machine. 15 GB of space will give you enough room for the Jetpack downloaded files. I set mine with 30 GB of space because I want to have other stuff on this VM for later.

ubuntu-virtual-machine-virtualbox

Step 2) On the virtual machine, download the latest Jetpack installer here. You will need to log in or create a new member account.

The latest version of Jetpack installer will be below. We are using JetPack Version 2.2.

https://developer.nvidia.com/embedded/jetpack

find-jetpack-download-file

 

log-in-or-create-nvidia-account

After logging in, hit the blue button and download JetPack.

hit-the-blue-button-to-download-jetpack

Step 3) You should have a file called JetPack-L4T-2.2-linux-x64.run. The name may be different, but we want to run it.

Open up a new terminal and go to the directory where JetPack was downloaded.

cd ~/Downloads

We want to change the permissions of JetPack, so that we can run it in the terminal.

chmod 755 JetPack-L4T-2.2-linux-x64.run

Now, we can run the program.

sudo ./JetPack-L4T-2.2-linux-x64.run

Step 4) Downloading JetPack packages.

After running the above terminal command, a JetPack window should pop up.

the-first-next

Hit Next a couple of times.

Select Jetson TX1 Development Kit (64-bit) and hit Next.

Select Custom because we don’t need half of the JetPack stuff.

jetpack-custom-installation

We will set most of these to no action by clicking underneath the Action column and setting the packages to no action.

set-most-to-no-action

The packages that we want are: CUDA Toolkit for Ubuntu 14.04, Linux for Tegra (TX1 64-Bit), Flash OS, CUDA Toolkit for L4T, and Compile CUDA Samples.

jetpack-what-you-need-to-download

We just don’t need most of the other stuff if you only want CUDA on your NVIDIA Jetson TX1. Pick and choose any other extra packages if you want them.

Step 5) Hit Next to initiate the download and wait.

Hit Next and Accept All Terms and Conditions.

accept-all-terms-and-conditions-jetpack

Depending on the component selection, please pay attention to the prompt embedded terminal. OK.

Sit back and relax because these download files are fairly big, so we’ll have to wait a while.

sit-back-jetpack-will-take-a-while

 

jetpack-download-speeds

JetPack Host installation will complete, and you can click Next to Proceed.

jetpack-installation-complete

The prompt will ask you about Network Layout. I chose Device accesses Internet via router/switch.

Please select the network interface on host that connects to the same router/switch as:

I put wlan0 because I will be using the antennas to access the Internet through Wi-Fi. Our host computer will be using the Internet to send files to our NVIDIA TX1 Jetson. Hit Next.

Step 6) Post Installation. We will have to put our NVIDIA TX1 Jetson in Force USB Recovery Mode.

jetpack-post-installation-steps

After hitting Next on this prompt, you will be brought to the Flash 64 Bit OS to TX1 device step.

jetpack-putting-nvidia-tegra-in-recovery-mode

The black terminal window says that we have to put the Jetson into Force USB Recovery Mode.

  1. Power down the Jetson.
  2. Connect the Micro-B to USB cable from the Jetson to your computer.
  3. Press the POWER button and let go. The Jetson powers up like normal. Press and hold the FORCE RECOVERY button, and while holding the FORCE RECOVERY button, press the RESET button and let go of the RESET button. After two more seconds, let go of the FORCE RECOVERY button.

 

Make sure that your virtual machine detects the NVIDIA Corp USB device. Go to the Devices tab at the top of the virtual machine, go to USB, and select NVIDIA Corp. APX.

nvidia-corp-detected-on-vm

Back at the black terminal window, press Enter, and the OS flashing starts. Now, you just wait.

nvidia-jetson-jetpack-doing-its-business

Flashing completes, and you press Enter in the black terminal window.

post-installation-completed

Step 7) After flashing completes, connect an HDMI cable to your monitor. Your Jetson should have booted into Ubuntu 14.04. Connect to Wi-Fi on your Jetson.

Your virtual machine wants to run the CUDA installation instructions to your Jetson, but it can’t find the Jetson’s IP address!

time-to-connect-the-tegra-to-wifi

If you connect your Jetson to a monitor, you will see that your Jetson should have booted into Ubuntu 14.04. Now, we connect the Jetson to Wi-Fi.

IMG_20160709_134508

The password for the ubuntu user is: ubuntu.

I’m only using the keyboard to maneuver. Press ALT + F1, press Enter, and search for “Network.” Use tabs to maneuver and open Wi-Fi. Connect to a WiFi network.

IMG_20160709_134847

Now, we want to open a terminal and press CTRL + ALT + T. With the terminal open, we type:

ifconfig

IMG_20160709_135658

We see that the given IP address for our Jetson is: 192.168.1.114

Now back on our virtual machine on this screen, we hit 2 and press Enter.

time-to-connect-the-tegra-to-wifi

A JetPack window will pop up and we can fill in the: Device IP address, User Name, and Password:

enter-device-ip

User Name and Password are both ubuntu. Hit Next, and you will be brought to Post Installation for CUDA for the Jetson.

Step 8) Post installation for CUDA.

Hit Next on this screen.

cuda-post-installation

JetPack will be copying CUDA files onto the Jetson through the Internet. It will also run CUDA installation commands on your Jetson.

post-installation-for-cuda

CUDA takes a really long time to copy and install, so you’ll be waiting a long while. After CUDA finishes, a JetPack window will pop up and Installation will be Complete.

jetpack-finishes

Step 9) Making sure that CUDA is installed on the Jetson.

Back to the Jetson, open a new terminal with CTRL + ALT + T.

cd ~/cuda-l4t

You can use cuda-l4t.sh to install CUDA 7.0. In this folder, there is also the .deb file for CUDA 7.0.

sudo ./cuda-l4t.sh .cuda-repo-l4t-7-0-local_7.0-76_arm64.deb 7.0 7-0

Hit Y and Enter on any prompt asking for permission. CUDA 7.0 should be installed, but its binaries haven’t been applied globally yet. An entry has been automatically added to ~/.bashrc, but you still need to reload the ~/.bashrc.

source ~/.bashrc

Now, check if CUDA 7.0 is installed.

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Thu_May__5 22:52:38_CDT_2016
Cuda compilation tools, release 7.0, V7.0.74

Step 10) Testing if CUDA 7.0 works on the Jetson.

JetPack has set up some CUDA samples that we can use to test.

cd ~/NVIDIA_CUDA-7.0_Samples/bin/aarch64/linux/release

We can run the Ocean Simulation sample. Cool!

./oceanFFT

IMG_20160709_143708

We can test the nbody sample to check our Jetson’s performance.

./nbody -benchmark -numbodies=65536

IMG_20160709_144011

I’m getting 264.744 single-precision GFLOP/s at 20 flops per interaction. In the past, we’ve gotten 318.763 single-precision GLOP/s at 20 flops per interaction.

But, we certainly know that CUDA 7.0 is working on the NVIDIA Jetson TX1! Leave any questions below, and run more CUDA samples for fun.

IMG_20160709_144913
./smokeParticles

How to compile HPL (LINPACK)

This guide will show you how to compile HPL (Linpack) and provide some tips for selecting the best input values for hpl.dat based on my experiences at the student cluster competitions.

This benchmark stresses the computers floating point operation capabilities.

Although just calculating FLOPs is not reflective of applications typically run on supercomputers, floating point is still important when precise calculations are required.

I assume a version of mpi, c/c++/fotran compilers, blas and whatever libraries you need are installed.

There are many versions of linpack for different archictures, ranging from an intel version to a CUDA version. The modifications for all versions are very similar. Below I have linked some of the different versions.

Compiling HPL

The first step is to make a copy of an existing makefile in the setup/ folder and place this in the root directory of HPL. I suggest Make.Linux_ATHLON_CBLAS, since that is the closest to generic systems. Call this file Make.[whatever] For CUDA and Intel, Make.CUDA and Make.intel64 are already created for you.

Here, you may or may not need to modify TOPdir. Typically you should specify the full path to your HPL directory.

Next is specifying the location of your MPI files and binaries. MPdir should specify the exact file path to the version of MPI you want to use, up to the root where include, lib, and bin are located.

MPinc should be the same as image above.

MPlib is similar to the image above, except libmpich.a would depend on the MPI version you installed.

** Note: if you are using a *.so instead of *.a, like the one in the image, then you need to add the library path to your environment.

vim ~/.bashrc

Add the following to the end of the file:

export $LD_LIBRARY_PATH=/path to mpi/lib:$LD_LIBRARY_PATH

Then, reload the file and now it can be found every time you log on to your system.

source ~/.bashrc


Next is linking the BLAS libraries. LAdir should specify the exact location of the BLAS binary.

LAinc should specify the include directory, if you need to include it

LAlib should specify the BLAS binary. If the blas file is a *.so file, you can follow the steps above to add it to your environment.


For HPL_OPTS, add -DHPL_DETAILED_TIMING, for better analysis of tuning the HPL.dat file.


Lastly, you can specify your compiler(cc) and compiler flags(ccflags).

Now to compile:

make arch=[whatever]

If you linked everything correctly, then in the bin/[whatever]/ directoy, there should be a .xhpl binary. Otherwise, you need to figure out which library was not linked properly.

Now navigate to bin/[whatever]/ to modify the HPL.dat file.

Modifying HPL.dat

The most important lines are:

  • Ns – size of matrix
  • Nbs – block sizes each process should operate on at a time
  • Ps and Qs – PxQ process you want to run the matrix on
Ns

N should typically be around 80-90% of the size of total memory.

N can be calculate by:
N = sqrt((Memory Size in Gbytes * 1024^3 * Number of Nodes) / Double Precison(8)) * percentage

Large Ns yield better results.

NBs

NBs is typically around 32 to 256. A small NBs is good for single node. NBs in the low 100 or 200 is good for multiple nodes. Larger NBs close to 1000 or more than 1000 is good for GPUs and accelerators.

I normally select NBs to be multiple of Ns, so there is no performance dropoff on towards the end of the calculation.

For the rest of the parameters, you can read about it here. You can find the CUDA tuning information in the CUDA HPL version. For Intel, you can find it here.

How to Install CUDA on NVIDIA Jetson TX1 [Deprecated]

Updated 2016 post – detailed version. (The method shown in this guide is outdated) This guide shows you how to install CUDA on the NVIDIA Jetson TX1. Currently, Nvidia’s Jetpack installer does not work properly. This blog post will show a work-around for getting CUDA to work on the TX1.

Download the following files inside a directory first. Here are the two links for the files that you will need to download beforehand:

Updating your apt-get sources

Navigate to the directory where you downloaded the files and type in:

dpkg -i cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb

Next, you want to update the sources by typing in:

apt-get update

Now, your apt-get repositories will have all of the CUDA libraries and files you may need for any future modifications.

Installing CUDA dependencies

Next, go to the directory where you downloaded the Jetpack installer and make the file executable by typing in:

chmod +x JetPack-L4T-2.0-linux-x64.run

Now, run the file:

./JetPack-L4T-2.0-linux-x64.run

The .run file should have unpacked its contents into a new directory called “_installer”

Go into the _installer directory and type in:

./cuda-l4t.sh ../cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb 7.0 7-0

**Note that ../cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb is the location of the .deb file you downloaded earlier.

Now, you have every CUDA dependency installed. However, there are a few more things you have to do.

 

NVCC as a global call

NVCC is not linked globally (nvcc -V gives an error), and you need to do a few more things to fix this. First, let’s edit the .bashrc file.

vim .bashrc

The screenshot below shots what should be appended to the .bashrc file after the installation. I also put a copy of the exports below.

“:$PATH” should be after “export PATH=/usr/local/cuda-7.0/bin”

Now, execute the .bashrc file.

source .bashrc

Now to make sure that everything is working, type in:

nvcc -V

 

Running Some CUDA Samples

Now, let’s run some CUDA samples and scale the GPU to max frequency.

Scaling GPU Frequency

You can find out your GPU rates by typing in:

cat /sys/kernel/debug/clock/gbus/possible_rates

Now, let’s set the GPU frequency to its maximum possible rate for some performance purposes.

echo 998400000 > /sys/kernel/debug/clock/override.gbus/rate
echo 1 > /sys/kernel/debug/clock/override.gbus/state
cat /sys/kernel/debug/clock/gbus/rate

You can lower the GPU frequency with the same steps above.

Running the nbody simulation and smoke particles

Navigate to the simulations (5_Simulations) directory containing both the nbody simulation and smoke particles samples:

cd /usr/local/cuda-7.0/samples/5_Simulations/

Now, navigate to the nbody directory and run make. Then, type in:

./nbody -benchmark -numbodies=65536

The results are approximately two times the performance of the previous generation Jetson, the TK1 @ 157 GFLOPS

Navigate to the smokeparticles directory and run make. Then type in:

./smokeParticles