rapidsai experiment round 2- pagerank algorithm with cugraph
source:https://rapids.ai/start.html

rapidsai experiment round 2- pagerank algorithm with cugraph

This is round 2 of the experiment with rapidsai

For those of you who are not familiar with how to set up the environment , please read round 1 post link below

for those of you who already have the environment to experiment on rapidsai , today, this short post might surprise you ....

Today, we turn our focus on an old algorithm - PageRank , well if you are not familiar with it, below are two urls for quick recap :)

Alright, let's run through the pageRank algorithm executed by networkx (CPU) vs cugraph (GPU) on tiny (8 records) vs larger (1000 records) dataset

############# PageRank with Tiny dataset (only 8 records) #############

now if we use tiny dataset ( only 8 records )

well the cugraph performance (0.0041 sec) is embarrassingly slower than the networkx performance (0.0011 sec)

In [1]:

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


# get the data
df=pd.read_csv('./data/pageRankData.csv',index_col=False,sep=';')
#reindex and make the singleroute as the index itself
df2=df.set_index(['steps'])
n=len(df2)
print("number of rows in csv data is ", n)
d=dict((idx,i) for i,idx in zip(range(n),list(df2.index)))

Out[1]:

number of rows in csv data is  8
h

Call the PageRank algorithm from networkx

In [25]:

#pr_nx = nx.pagerank(Gnx, alpha=alpha, max_iter=max_iter, tol=tol)
import time
# define the parameters 
max_iter = 100  # The maximum number of iterations
tol = 0.00001   # tolerance
alpha = 0.85    # alpha
start=time.time()
pr_nx= nx.pagerank(G,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using networkx pagerank algorithm took  %s -----  "% (time.time()-start))

Out[25]: see the performance with networkx is ~0.0011

----- process using networkx pagerank algorithm took  0.0011019706726074219 -----  


Call the PageRank algorithm from cugraph

In [27]:

# Call cugraph.pagerank to get the pagerank scores
start=time.time()
gdf_page = cugraph.pagerank(G_gpu,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using cugraph pagerank algorithm took  %s -----  "% (time.time()-start))

Out[27]: see the cugraph performance is 0.0041

----- process using cugraph pagerank algorithm took  0.0041408538818359375 -----  

########### PageRank algorithm with Larger dataset (1000 records) ###########

first we need to write a function to generate 1000 samples of none-repeating tuples of nodes connected to one another as below shown

No alt text provided for this image

well the cugraph performance (0.00643 sec) is then realized with a speed-up at least 4X faster than the networkx performance (0.0282 sec)

Call the PageRank algorithm from networkx for larger dataset (1000 samples)

In [27]:

#pr_nx = nx.pagerank(Gnx, alpha=alpha, max_iter=max_iter, tol=tol)
import time
# define the parameters 
max_iter = 100  # The maximum number of iterations
tol = 0.00001   # tolerance
alpha = 0.85    # alpha
start=time.time()
pr_nx= nx.pagerank(G,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using networkx pagerank algorithm took  %s -----  "% (time.time()-start))

Out[27]: networkx performance is roughly 0.0282 seconds

----- process using networkx pagerank algorithm took  0.028274059295654297 -----  


Call the PageRank algorithm from cugraph for larger dataset (1000 samples)

In [30]:

# Call cugraph.pagerank to get the pagerank scores
start=time.time()
gdf_page = cugraph.pagerank(G_gpu,alpha=alpha, max_iter=max_iter, tol=tol)
print("----- process using cugraph pagerank algorithm took  %s -----  "% (time.time()-start))

Out[30]: cugraph performance is roughly 0.00643 seconds

----- process using cugraph pagerank algorithm took  0.006431102752685547 -----  

well, this makes sense since parallel computing (i.e utilizing GPUs ) is supposed to be used when one indeed has BIG data problem, and therefore requires high performance computing power !!!

in the case of the tiny dataset, we might be killing a mouse with bazooka, which is not very efficient !

however, when one DOES have a LARGE ENOUGH dataset ( such as our 1000 records' fake data) you can immediately see the performance speed up is thereby fulfilled!!!

here is a short video clip of this experiment ( please mute while watching it, since there were some inevitably background noise while the recording took place :P )

git repo and data is here for your convenience !

I do hope that you enjoy this post and happy experimenting ^___^b


To view or add a comment, sign in

More articles by Zenodia Charpy

Explore content categories