rapidsai experiment round 2- pagerank algorithm with cugraph

Zenodia Charpy

Published Nov 18, 2019

This is round 2 of the experiment with rapidsai

For those of you who are not familiar with how to set up the environment , please read round 1 post link below

for those of you who already have the environment to experiment on rapidsai , today, this short post might surprise you ....

Today, we turn our focus on an old algorithm - PageRank , well if you are not familiar with it, below are two urls for quick recap :)

Alright, let's run through the pageRank algorithm executed by networkx (CPU) vs cugraph (GPU) on tiny (8 records) vs larger (1000 records) dataset

############# PageRank with Tiny dataset (only 8 records) #############

now if we use tiny dataset ( only 8 records )

well the cugraph performance (0.0041 sec) is embarrassingly slower than the networkx performance (0.0011 sec)

In [1]:

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


# get the data
df=pd.read_csv('./data/pageRankData.csv',index_col=False,sep=';')
#reindex and make the singleroute as the index itself
df2=df.set_index(['steps'])
n=len(df2)
print("number of rows in csv data is ", n)
d=dict((idx,i) for i,idx in zip(range(n),list(df2.index)))

Out[1]:

number of rows in csv data is  8
h

Call the PageRank algorithm from networkx

In [25]:

#pr_nx = nx.pagerank(Gnx, alpha=alpha, max_iter=max_iter, tol=tol)
import time
# define the parameters 
max_iter = 100  # The maximum number of iterations
tol = 0.00001   # tolerance
alpha = 0.85    # alpha
start=time.time()
pr_nx= nx.pagerank(G,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using networkx pagerank algorithm took  %s -----  "% (time.time()-start))

Out[25]: see the performance with networkx is ~0.0011

----- process using networkx pagerank algorithm took  0.0011019706726074219 -----

Call the PageRank algorithm from cugraph

In [27]:

# Call cugraph.pagerank to get the pagerank scores
start=time.time()
gdf_page = cugraph.pagerank(G_gpu,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using cugraph pagerank algorithm took  %s -----  "% (time.time()-start))

Out[27]: see the cugraph performance is 0.0041

----- process using cugraph pagerank algorithm took  0.0041408538818359375 -----

########### PageRank algorithm with Larger dataset (1000 records) ###########

first we need to write a function to generate 1000 samples of none-repeating tuples of nodes connected to one another as below shown

well the cugraph performance (0.00643 sec) is then realized with a speed-up at least 4X faster than the networkx performance (0.0282 sec)

Call the PageRank algorithm from networkx for larger dataset (1000 samples)

In [27]:

#pr_nx = nx.pagerank(Gnx, alpha=alpha, max_iter=max_iter, tol=tol)
import time
# define the parameters 
max_iter = 100  # The maximum number of iterations
tol = 0.00001   # tolerance
alpha = 0.85    # alpha
start=time.time()
pr_nx= nx.pagerank(G,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using networkx pagerank algorithm took  %s -----  "% (time.time()-start))

Out[27]: networkx performance is roughly 0.0282 seconds

----- process using networkx pagerank algorithm took  0.028274059295654297 -----

Call the PageRank algorithm from cugraph for larger dataset (1000 samples)

In [30]:

# Call cugraph.pagerank to get the pagerank scores
start=time.time()
gdf_page = cugraph.pagerank(G_gpu,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using cugraph pagerank algorithm took  %s -----  "% (time.time()-start))

Out[30]: cugraph performance is roughly 0.00643 seconds

----- process using cugraph pagerank algorithm took  0.006431102752685547 -----

well, this makes sense since parallel computing (i.e utilizing GPUs ) is supposed to be used when one indeed has BIG data problem, and therefore requires high performance computing power !!!

in the case of the tiny dataset, we might be killing a mouse with bazooka, which is not very efficient !

however, when one DOES have a LARGE ENOUGH dataset ( such as our 1000 records' fake data) you can immediately see the performance speed up is thereby fulfilled!!!

here is a short video clip of this experiment ( please mute while watching it, since there were some inevitably background noise while the recording took place :P )

git repo and data is here for your convenience !

I do hope that you enjoy this post and happy experimenting ^___^b

To view or add a comment, sign in

rapidsai experiment round 2- pagerank algorithm with cugraph

Zenodia Charpy

Call the PageRank algorithm from networkx

Call the PageRank algorithm from cugraph

Call the PageRank algorithm from networkx for larger dataset (1000 samples)

Call the PageRank algorithm from cugraph for larger dataset (1000 samples)

More articles by Zenodia Charpy

Explore content categories

Call the PageRank algorithm from networkx

Call the PageRank algorithm from cugraph

Call the PageRank algorithm from networkx for larger dataset (1000 samples)

Call the PageRank algorithm from cugraph for larger dataset (1000 samples)

More articles by Zenodia Charpy

experiment TF2.0 training on 1 CPU vs. 1 GPU vs.multiple-GPUs+MixedPrecision with Unet model

train your model with augmented data using DALI

Automate data augmentation with Nvidia's DALI

experiment on using docker + inject dynamic variables to tune a unet's hyper-params

converting Keras 2 Tensorflow2.0

The out-of-this-world experience with Rapidsai round 1: CPU(with sklearn) vs. GPU(with cuml)

ActiveLearning with Keras with TF backend with K80 GPU

experiment NGC on Azure - HPC at your fingertip !

serve jupyter notebook with docker, containerise with all your favorite python libraries to enable reproducibility & collaboration

Use bash command to wrap Docker command to build , run and stop docker

Explore content categories