Multivariate Regression using Python
Image Credits - Google Images

Multivariate Regression using Python

Sharing a self written script to generate a multivariate regression. In keeping with the current trend I have used state-wise Covid data for Indian cases. My hypothesis here is that the state-wise numbers depend on the urban population , rural population etc. factors particular to the state.

Through this regression I want to -

  1. Test my hypothesis that state-wise confirmed cases can be determined by factors such as population , urban proportion , density etc.
  2. Check and establish a relationship between these two factors

---

Script:

import pandas as pd
import  numpy as np
import  statsmodels.api as sm
import  matplotlib.pyplot as plt

covid_count = pd.read_html(r"https://www.mygov.in/covid-19")
India_Covid = covid_count[0]
India_Covid = India_Covid.drop(axis=0,index=8) #Dadra nagar Haveli is repeated twice in the DataFrame so cleaning the data
India_Covid = India_Covid.set_index("State/UTs") #Setting new index to State which will be used in the join
#print(India_Covid)

States = pd.read_html(r"https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_population")
State_info = States[1]
#Cleaning the data
State_info.drop(columns="Rank",inplace=True)
State_info.columns = ['State/UTs', 'Population', 'Population_share','Decadal_growth', 'Rural_population', 'Percent_rural','Urban_population', 'Percent_urban', 'Area', 'Density','Sex_ratio']
#Data Cleaned
State_info.set_index("State/UTs",inplace=True) #Setting new index to State which will be used in the join
State_info.drop(index="India",axis=0,inplace=True) #Dropping the last row , can be included in the cleaning step
#print(State_info)

CovidInd = India_Covid.merge(State_info,how="inner",on="State/UTs",suffixes=["left","right"])
final = CovidInd.iloc[:,[0,1,2,3,7,9,13]]
#print(CovidInd)

#Regression
X = final.iloc[:,[4,5,6]] #Independent variables
Y = final.iloc[:,0] #Dependent variables

X = sm.add_constant(X) #Adds the Y intercept to our model

model = sm.OLS(Y,X).fit() #Passing the independent and dependent values
predictions = model.predict(X)

print(model.summary()) #Print out the regression statistics

---

The above written Python script helps me to do that .I am importing the real time state-wise Covid case numbers published by the Indian govt via their website . From Wikipedia , I have imported some population and related information. I will join these two tables to get a final table which will have my dependent variables (urban , rural population , density , sex ratio etc.) and my independent variable (Confirmed cases) for states and via regression i will establish a linear relationship among them. States with no confirmed cases or will be left out of the regression

Results:

This is the output of the regression. it may differ when you run it since the govt Covid numbers are real-time and hence dynamic.

No alt text provided for this image

Interpretation:

Regression equation takes the form Y = c + ax1 + bx2 + cx3 + .... with Y being the dependent and x1 , x2 , x3 ... xn being the independent variables.

For our regression , the equation is Confirmed Cases = 9.34e+04 -0.008*Rural_Population + 0.0038*Urban_Population - 101.3083*Sex_ratio

Adjusted R-Square - Reflects the fit and predictive power of the model with the value being 0 to 1. A high Adjusted R square value is desired. Here the Adj R-sq. value is 0.734 indicating moderately good fit.

Constant - Y-intercept on the X-Y graph. In other words , value of the dependent variable when independent variables are '0'. Constant

Coefficients - These are the coefficients of the independent variables in the regression equation . So from above equation these are the 'a' , 'b' , 'c' etc. values

Standard Error - Reflects the accuracy of coefficients. Lower value desired.

P-value - Should be less than 0.05. Indicates whether the independent variable can be used to predict changes in the dependent variable. In this case the urban , rural population variables are significant while 'sex_ratio' is not a significant variable for this model.

Sources: Python IDE - Pycharm ; Covid numbers - Gov website ; Population Data - Wikipedia

Disclaimer: This example should not be used as a predictive model for predicting any future cases. It was used to gain some hands-on on Python and to illustrate a regression example using real world data.

Troubleshooting the script: Please ensure you have all the required packages installed in your IDE so that the script runs seamlessly on copy-pasting.

Great article, but unfortunately this is not multivariate regression. This is multiple (linear) regression. Multivariate regression has only 1 predictor (or independent variable) while several dependent variables. In your code there is more than 1 predictor.

To view or add a comment, sign in

Others also viewed

Explore content categories