Multiple linear regression algorithm : Applying the ‘Dummy variable’ concept to Zonal sales Business case


Business case:

In this article, I will taken a business case of Zonal wise sales data of XYZ company that operates in different zones and the amount invested in each of the emerging technology businesses in two zones , East and West

Here, we are trying to understand if there exists a co-relation between the sales numbers and the money invested in respective business. For example, looking at row number 1 of above table, we are interested in a question like : How much a company should be spending or investing in VR , IoT or AI in order to maximize its sales?

If we break down above data set to apply multiple linear regression , sales becomes the dependent variable , because we are always interested in predicting the sales based off some independent variables.The independent variables on this data set is the money invested on VR, IoT and AI businesses

We clearly notice that all of the above variables have numbers on them in the form of money invested.However, the Zone column consists of strings or simply put , words in English language. We know that we need to add numbers to a math equation. Hence, we need a way to somehow decode the names of Zones mentioned into 0s and 1s

Dummy variable concept:

We will apply the concept of dummy variables to do this.First , we will have to count the number of categories on the Zone column.We see there are two categories , namely: East and West

The first step to creating dummy variables is by creating that many number of columns that match the number of categories.Here, we will then create two columns with one column each for one category as seen below

Here, we will map and place the number 1 against each instance of East zone on above data set and place it under the East column. Similarly, for West column above

Next , we will apply the multiple linear regression equation to our original data set:

y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*Dv1

Where, y is the dependent variable, sales

A constant , which is b0 and it is also the y -intercept.Y intercept is simply the point where the regression trend line meets or cuts the Y axis on a 2D plane Co-efficients in the equation are b1,b2,b3 and b4 and x1,x2,x3 are independent variables with x1 = Virtual reality, x2 = IoT and x3 = AI

Dv1 is referring to dummy variable 1 , meaning it is East zone. We should not include the second dummy variable ‘West zone” into our regression equation, since, in b4*Dv1, Dv1 is referring to East zone and when Dv1 = 1, the entire equation holds true for East zone

On the contrary, when Dv1 = 0 , meaning, when the Zone is not East( in other words when it is West), value of b4*Dv1 becomes 0.In this case, the entire equation holds true for West zone

To better understand the underpinnings of the dummy variable concept , we can think about it as an yes or a no situation relevant to the East zone’s applicability OR an opening and closing box 

To view or add a comment, sign in

More articles by Sajeesh Nair

Explore content categories