The Power Of Vectorization
Introduction:
In today's world, data has been increasing exponentially. It is estimated that there will be a ten fold increase in data by 2025 to 163 Zeta Bytes (IDC forecast). This presents many challenges; such as storage, security and computational speed. In today's post we focus on the computational speed. As a Machine Learning Practitioner, the best practices when dealing with huge datasets are cardinal. One such key is Vectorization. What is vectorization? It is the art of getting rid of for loops (one of the work horses of the pre-big data era).
Why is Vectorization Important?
Vectorization significantly speeds up your computations especially in the Big Data Domain. To illustrate this facet, I wrote some code in Python 3. The code performs the same computation but uses two different approaches; vectorization(no for loop) and a for loop, the code measures the execution time for each method and prints the execution time duration for comparison purposes. It turns out that vectorization is much quicker than computations relying on iterations using the for loop. This makes sense because it flows naturally from vector multiplication in NumPy. See the code below which you can run in Jupyter or any Python IDE. The code is written in Python 3.
Code:
## Code starts here
import numpy as np
#Let us compute the time it will take to compute the dot product of 20 million random numbers in vector #form using NumPy
import time
a =np.random.rand(20000000)
b =np.random.rand(20000000)
tic = time.time()
c=np.dot(a,b)
toc=time.time()
time_lapse_vector_method=round((1000*(toc-tic)),3)
print(round(c,6))
print("This is how fast vectorization is:"+str(time_lapse_vector_method)+"ms")
## Let us compute the time it will take to do the same operation but using a for loop instead
c=0
tic= time.time()
for i in range(20000000):
c+=a[i]*b[i]
toc=time.time()
time_lapse_For_Loop_method=round((1000*(toc-tic)),3)
print(round(c,6))
print("This is how slow the For Loop is:"+str(time_lapse_For_Loop_method)+"ms")
#### End of Code
Sample Results:
(Vectorization Method, For Loop Method)= {(26.07ms, 15936.326ms),(35.087ms,15805.623ms), (25.066ms,15502.911ms)}
Conclusion:
As can be seen from the sample results above, vectorization is much faster than using for loops by factors of the order of multiples of hundreds (e.g. 600 times faster).Since we are now in the era of big data, using vectorization is one technique that can help speed up computations. It is one of the good bag of tricks one can use in different projects and scenarios. Try out the code multiple times and see!!
This is cool, indeed!