Skip to content

Graph500 on CUDA

I recently went to ISC16 for their Student Cluster Competition, and one of the challenges was to create our “own implementation of Graph500 to run on a cluster.”

If you don’t know about the Student Cluster Competition, they are cluster competitions where student teams work with vendors to create a cluster and optimize high performance scientific applications to be run under 3000 watts of power on real datasets.

Graph500 is a rating of supercomputer systems focused on data intensive loads. There are two main kernels.

The first kernel constructs an undirected graph. The second kernel performs breadth-first search of the graph. Both kernels are timed.

There are a bunch of other nitty-gritty requirements outlined in their full specifications page.

The Graph500 reference code and implementations only contain sequential, OpenMP, XMT, and MPI versions.

The original developers have provided CPU implementations, but where’s the CUDA version? No one wants to put their awesome versions of Graph500 open source!


Existing Open Source CUDA Graph500

While searching to find Graph500 on CUDA, we found only one open source version provided by the Suzumura Laboratory.

The Suzumura Laboratory has done a great contribution to the open source community on Graph500 with their papers, “Parallel Distributed Breadth First Search on GPU” and “Highly Scalable Graph Search for the Graph500 Benchmark” written by Koji Ueno and Toyotaro Suzumura.

Their version was created on June 2012.

First thing that we wanted our Graph500 to be was open source.

The HPC Advisory Council states that “Other implementations of Graph500 exist and likely to improve performance, however not freely obtainable.”



Our version of Graph500 on CUDA with MPI

We made a much simpler model for Graph500 that you may want to check for understanding the Graph500 specifications easier. We created ours on June 2016.

We’ll be updating this post to explain how we created our version of Graph500 on a later day, but we hope that our source code will help you run Graph500 on your cluster with NVIDIA GPUs or create a version yourselves!


Testing our version of Graph500

We will test our version of Graph500 on a single NVIDIA Jetson TX1. Below are the NVIDIA Jetson TX1 specifications:


The prequisite for running our version of Graph500 is having CUDA. Our NVIDIA Jetson TX1 already had CUDA and OpenMPI installed when we set up Ubuntu 14.04. To check if you have CUDA setup, run:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Thu_May__5_22:52:38_CDT_2016
Cuda compilation tools, release 7.0, V7.0.74

To check if you have MPI setup, run:

mpirun --version
mpirun (Open MPI) 1.6.5

Report bugs to

If you don’t have CUDA or MPI, you will need to Google how to do it for your respective operating system. Now, we have to install git to download our source code.

sudo apt-get install git -y

After you install git, you can git clone our repository.

git clone
cd isc16-graph500/

Make any changes to the Makefile to fit your location of CUDA.

vim Makefile

I have to make a small change to the LDFLAGS value. lib should be lib64 for me.


INCLUDE= -Iinclude -I/usr/local/cuda/include
[email protected]


INCLUDE= -Iinclude -I/usr/local/cuda/include
[email protected]
CXX src/constructGraph.cpp
CXX src/graph.cpp
CXX src/edgeList.cpp
CXX src/init.cpp
CXX src/breadthFirstSearch.cpp
CXX src/main.cpp
CXX src/generateKey.cpp
CXX src/validation.cpp
NVCC src/
NVCC src/
CXX main

The Graph500 binary created is called main.

[USAGE] ./main <config.ini> <scale> <edgefactor>

the total number of vertices, 2^SCALE.

the number of edges. M = edgefactor * N.

We will use main with mpirun and keep track of runtime with the time command. You may supply a hostfile. Check for a sample command to run the program.

In this example, we’ll use a SCALE of 6 and edgefactor of 1 for a result of 64 vertices and 64 edges. But first, how many cores we can use?

grep -c ^processor /proc/cpuinfo

Now, we will run Graph500 on a single node with a very small graph.

time mpirun -n 4 ./main config.ini 6 1
Constructing graph...

Running 64 BFSs...
Got 24291.5 TEPS
Got 29580.9 TEPS
Got 60708.3 TEPS
Got 35433.1 TEPS
Got 62827.2 TEPS
Got 36697.2 TEPS
Got 54054.1 TEPS
Got 51428.6 TEPS
Got 53892.2 TEPS
Got 38461.5 TEPS
Got 53491.8 TEPS
Got 54298.6 TEPS
Got 42452.8 TEPS
Got 41142.9 TEPS
Got 54628.2 TEPS
Got 32727.3 TEPS
Got 54711.2 TEPS
Got 35928.1 TEPS
Got 62827.2 TEPS
Got 39955.6 TEPS
Got 63492.1 TEPS
Got 36108.3 TEPS
Got 58919.8 TEPS
Got 70588.2 TEPS
Got 42007 TEPS
Got 34515.8 TEPS
Got 54380.7 TEPS
Got 36659.9 TEPS
Got 58631.9 TEPS
Got 63492.1 TEPS
Got 53973 TEPS
Got 53491.8 TEPS
Got 69632.5 TEPS
Got 41237.1 TEPS
Got 60200.7 TEPS
Got 36363.6 TEPS
Got 62283.7 TEPS
Got 42402.8 TEPS
Got 48913 TEPS
Got 40000 TEPS
Got 69632.5 TEPS
Got 42755.3 TEPS
Got 61120.5 TEPS
Got 40678 TEPS
Got 48192.8 TEPS
Got 37228.5 TEPS
Got 55900.6 TEPS
Got 46272.5 TEPS
Got 3831.42 TEPS
Got 41618.5 TEPS
Got 60301.5 TEPS
Got 40540.5 TEPS
Got 70312.5 TEPS
Got 42553.2 TEPS
Got 48979.6 TEPS
Got 49792.5 TEPS
Got 53175.8 TEPS
Got 53254.4 TEPS
Got 63380.3 TEPS
Got 2178.65 TEPS
Got 59308.1 TEPS

real 0m0.752s
user 0m1.000s
sys 0m0.680s

We measure TEPS, traversed edges per second. Let m be the number of input edge tuples within the component traversed by the search, counting any multiple edges and self-loops. Let timeK2(n) be the measured execution time for kernel 2.

TEPS(n) = m / timeK2(n)

Let us know if you have any questions. Feel free to fork our repository and improve our Graph500 code! Our output may not be exactly the same as the official Graph500 implementation, but it does fit the specifications as far as we know.