I recently went to ISC16 for their Student Cluster Competition, and one of the challenges was to create our “own implementation of Graph500 to run on a cluster.”
If you don’t know about the Student Cluster Competition, they are cluster competitions where student teams work with vendors to create a cluster and optimize high performance scientific applications to be run under 3000 watts of power on real datasets.
Graph500 is a rating of supercomputer systems focused on data intensive loads. There are two main kernels.
The first kernel constructs an undirected graph. The second kernel performs breadth-first search of the graph. Both kernels are timed.
There are a bunch of other nitty-gritty requirements outlined in their full specifications page.
The Graph500 reference code and implementations only contain sequential, OpenMP, XMT, and MPI versions.
The original developers have provided CPU implementations, but where’s the CUDA version? No one wants to put their awesome versions of Graph500 open source!
Existing Open Source CUDA Graph500
While searching to find Graph500 on CUDA, we found only one open source version provided by the Suzumura Laboratory.
The Suzumura Laboratory has done a great contribution to the open source community on Graph500 with their papers, “Parallel Distributed Breadth First Search on GPU” and “Highly Scalable Graph Search for the Graph500 Benchmark” written by Koji Ueno and Toyotaro Suzumura.
Their version was created on June 2012.
First thing that we wanted our Graph500 to be was open source.
The HPC Advisory Council states that “Other implementations of Graph500 exist and likely to improve performance, however not freely obtainable.”
Our version of Graph500 on CUDA with MPI
We made a much simpler model for Graph500 that you may want to check for understanding the Graph500 specifications easier. We created ours on June 2016.
https://github.com/buhpc/isc16-graph500
We’ll be updating this post to explain how we created our version of Graph500 on a later day, but we hope that our source code will help you run Graph500 on your cluster with NVIDIA GPUs or create a version yourselves!
Testing our version of Graph500
We will test our version of Graph500 on a single NVIDIA Jetson TX1. Below are the NVIDIA Jetson TX1 specifications:
The prequisite for running our version of Graph500 is having CUDA. Our NVIDIA Jetson TX1 already had CUDA and OpenMPI installed when we set up Ubuntu 14.04
. To check if you have CUDA setup, run:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2015 NVIDIA Corporation Built on Thu_May__5_22:52:38_CDT_2016 Cuda compilation tools, release 7.0, V7.0.74
To check if you have MPI setup, run:
mpirun --version
mpirun (Open MPI) 1.6.5 Report bugs to http://www.open-mpi.org/community/help/
If you don’t have CUDA or MPI, you will need to Google how to do it for your respective operating system. Now, we have to install git
to download our source code.
sudo apt-get install git -y
After you install git
, you can git clone
our repository.
git clone https://github.com/buhpc/isc16-graph500
cd isc16-graph500/
Make any changes to the Makefile
to fit your location of CUDA.
vim Makefile
I have to make a small change to the LDFLAGS
value. lib
should be lib64
for me.
Original:
CC=mpicxx FLAGS=-std=c++11 INCLUDE= -Iinclude -I/usr/local/cuda/include LDFLAGS=-L/usr/local/cuda/lib LIB=-lcudart EXE=main NVCC=nvcc Q=@
New:
CC=mpicxx FLAGS=-std=c++11 INCLUDE= -Iinclude -I/usr/local/cuda/include LDFLAGS=-L/usr/local/cuda/lib64 LIB=-lcudart EXE=main NVCC=nvcc Q=@
make
CXX src/constructGraph.cpp CXX src/graph.cpp CXX src/edgeList.cpp CXX src/init.cpp CXX src/breadthFirstSearch.cpp CXX src/main.cpp CXX src/generateKey.cpp CXX src/validation.cpp NVCC src/buildAdjMatrix.cu NVCC src/bfsStep.cu CXX main
The Graph500 binary created is called main
.
[USAGE] ./main <config.ini> <scale> <edgefactor>
N
the total number of vertices, 2^SCALE.
M
the number of edges. M = edgefactor * N.
We will use main with mpirun
and keep track of runtime with the time
command. You may supply a hostfile. Check run.sh
for a sample command to run the program.
In this example, we’ll use a SCALE of 6
and edgefactor of 1
for a result of 64 vertices and 64 edges. But first, how many cores we can use?
grep -c ^processor /proc/cpuinfo
4
Now, we will run Graph500 on a single node with a very small graph.
time mpirun -n 4 ./main config.ini 6 1
Constructing graph... Done. Running 64 BFSs... Got 24291.5 TEPS Got 29580.9 TEPS Got 60708.3 TEPS Got 35433.1 TEPS Got 62827.2 TEPS Got 36697.2 TEPS Got 54054.1 TEPS Got 51428.6 TEPS Got 53892.2 TEPS Got 38461.5 TEPS Got 53491.8 TEPS Got 54298.6 TEPS Got 42452.8 TEPS Got 41142.9 TEPS Got 54628.2 TEPS Got 32727.3 TEPS Got 54711.2 TEPS Got 35928.1 TEPS Got 62827.2 TEPS Got 39955.6 TEPS Got 63492.1 TEPS Got 36108.3 TEPS Got 58919.8 TEPS Got 70588.2 TEPS Got 42007 TEPS Got 34515.8 TEPS Got 54380.7 TEPS Got 36659.9 TEPS Got 58631.9 TEPS Got 63492.1 TEPS Got 53973 TEPS Got 53491.8 TEPS Got 69632.5 TEPS Got 41237.1 TEPS Got 60200.7 TEPS Got 36363.6 TEPS Got 62283.7 TEPS Got 42402.8 TEPS Got 48913 TEPS Got 40000 TEPS Got 69632.5 TEPS Got 42755.3 TEPS Got 61120.5 TEPS Got 40678 TEPS Got 48192.8 TEPS Got 37228.5 TEPS Got 55900.6 TEPS Got 46272.5 TEPS Got 3831.42 TEPS Got 41618.5 TEPS Got 60301.5 TEPS Got 40540.5 TEPS Got 70312.5 TEPS Got 42553.2 TEPS Got 48979.6 TEPS Got 49792.5 TEPS Got 53175.8 TEPS Got 53254.4 TEPS Got 63380.3 TEPS Got 2178.65 TEPS Got 59308.1 TEPS Done. real 0m0.752s user 0m1.000s sys 0m0.680s
We measure TEPS
, traversed edges per second. Let m be the number of input edge tuples within the component traversed by the search, counting any multiple edges and self-loops. Let timeK2(n) be the measured execution time for kernel 2.
TEPS(n) = m / timeK2(n)
Let us know if you have any questions. Feel free to fork our repository and improve our Graph500 code! Our output may not be exactly the same as the official Graph500 implementation, but it does fit the specifications as far as we know.