Vectorization vs Loops: Boosting Performance in Python

Vectorization vs Loops: How it affects performance. People often say “Python is slow”. when I take a closer look I find out it has nothing to do with Python. It is how the code is written. I’ve seen data analysis scripts that loop through rows like this: - for each row - do a calculation - append results Let’s quickly look at a practical example. - We have a dataset with 1,000,000 rows and you want to apply a simple rule: If sales > 1000, mark it as high, else low. 1. Loop Approach labels = [] for value in df["sales"]: if value > 1000: labels.append("high") else: labels.append("low") df["category"] = labels What does this do? - Loops through every row in Python - Scales poorly as data grows - It’s hard to optimize further While looping works, it doesn’t scale and performance is at the lowest optimal level. Let’s try another approach for the same example. 2. Vectorized Approach df["category"] = np.where(df["sales"] > 1000, "high", "low") What does this do? - Operates on the entire column at once - Makes code easier and cleaner to reason about - Stays fast even as rows increase This gives exactly the same result and even a faster performance. Half the time optimal performance is not dependent on the bulk or beauty in pattern of code. A simple switch from row to row thinking to column level thinking can help achieve the best performance as data grows in your dataframe and model. #Python #Dataanalytics #Numpy #Optimization #Datascience

  • No alternative text description for this image

To view or add a comment, sign in

Explore content categories