rapidsai experiment round 2- pagerank algorithm with cugraph
This is round 2 of the experiment with rapidsai
For those of you who are not familiar with how to set up the environment , please read round 1 post link below
for those of you who already have the environment to experiment on rapidsai , today, this short post might surprise you ....
Today, we turn our focus on an old algorithm - PageRank , well if you are not familiar with it, below are two urls for quick recap :)
Alright, let's run through the pageRank algorithm executed by networkx (CPU) vs cugraph (GPU) on tiny (8 records) vs larger (1000 records) dataset
############# PageRank with Tiny dataset (only 8 records) #############
now if we use tiny dataset ( only 8 records )
well the cugraph performance (0.0041 sec) is embarrassingly slower than the networkx performance (0.0011 sec)
In [1]:
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# get the data
df=pd.read_csv('./data/pageRankData.csv',index_col=False,sep=';')
#reindex and make the singleroute as the index itself
df2=df.set_index(['steps'])
n=len(df2)
print("number of rows in csv data is ", n)
d=dict((idx,i) for i,idx in zip(range(n),list(df2.index)))
Out[1]:
number of rows in csv data is 8 h
Call the PageRank algorithm from networkx
In [25]:
#pr_nx = nx.pagerank(Gnx, alpha=alpha, max_iter=max_iter, tol=tol)
import time
# define the parameters
max_iter = 100 # The maximum number of iterations
tol = 0.00001 # tolerance
alpha = 0.85 # alpha
start=time.time()
pr_nx= nx.pagerank(G,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using networkx pagerank algorithm took %s ----- "% (time.time()-start))
Out[25]: see the performance with networkx is ~0.0011
----- process using networkx pagerank algorithm took 0.0011019706726074219 -----
Call the PageRank algorithm from cugraph
In [27]:
# Call cugraph.pagerank to get the pagerank scores
start=time.time()
gdf_page = cugraph.pagerank(G_gpu,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using cugraph pagerank algorithm took %s ----- "% (time.time()-start))
Out[27]: see the cugraph performance is 0.0041
----- process using cugraph pagerank algorithm took 0.0041408538818359375 -----
########### PageRank algorithm with Larger dataset (1000 records) ###########
first we need to write a function to generate 1000 samples of none-repeating tuples of nodes connected to one another as below shown
well the cugraph performance (0.00643 sec) is then realized with a speed-up at least 4X faster than the networkx performance (0.0282 sec)
Call the PageRank algorithm from networkx for larger dataset (1000 samples)
In [27]:
#pr_nx = nx.pagerank(Gnx, alpha=alpha, max_iter=max_iter, tol=tol)
import time
# define the parameters
max_iter = 100 # The maximum number of iterations
tol = 0.00001 # tolerance
alpha = 0.85 # alpha
start=time.time()
pr_nx= nx.pagerank(G,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using networkx pagerank algorithm took %s ----- "% (time.time()-start))
Out[27]: networkx performance is roughly 0.0282 seconds
----- process using networkx pagerank algorithm took 0.028274059295654297 -----
Call the PageRank algorithm from cugraph for larger dataset (1000 samples)
In [30]:
# Call cugraph.pagerank to get the pagerank scores
start=time.time()
gdf_page = cugraph.pagerank(G_gpu,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using cugraph pagerank algorithm took %s ----- "% (time.time()-start))
Out[30]: cugraph performance is roughly 0.00643 seconds
----- process using cugraph pagerank algorithm took 0.006431102752685547 -----
well, this makes sense since parallel computing (i.e utilizing GPUs ) is supposed to be used when one indeed has BIG data problem, and therefore requires high performance computing power !!!
in the case of the tiny dataset, we might be killing a mouse with bazooka, which is not very efficient !
however, when one DOES have a LARGE ENOUGH dataset ( such as our 1000 records' fake data) you can immediately see the performance speed up is thereby fulfilled!!!
here is a short video clip of this experiment ( please mute while watching it, since there were some inevitably background noise while the recording took place :P )
git repo and data is here for your convenience !
I do hope that you enjoy this post and happy experimenting ^___^b