Multivariate Regression using Python
Sharing a self written script to generate a multivariate regression. In keeping with the current trend I have used state-wise Covid data for Indian cases. My hypothesis here is that the state-wise numbers depend on the urban population , rural population etc. factors particular to the state.
Through this regression I want to -
- Test my hypothesis that state-wise confirmed cases can be determined by factors such as population , urban proportion , density etc.
- Check and establish a relationship between these two factors
---
Script:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
covid_count = pd.read_html(r"https://www.mygov.in/covid-19")
India_Covid = covid_count[0]
India_Covid = India_Covid.drop(axis=0,index=8) #Dadra nagar Haveli is repeated twice in the DataFrame so cleaning the data
India_Covid = India_Covid.set_index("State/UTs") #Setting new index to State which will be used in the join
#print(India_Covid)
States = pd.read_html(r"https://en.wikipedia.org/wiki/List_of_states_and_union_territories_of_India_by_population")
State_info = States[1]
#Cleaning the data
State_info.drop(columns="Rank",inplace=True)
State_info.columns = ['State/UTs', 'Population', 'Population_share','Decadal_growth', 'Rural_population', 'Percent_rural','Urban_population', 'Percent_urban', 'Area', 'Density','Sex_ratio']
#Data Cleaned
State_info.set_index("State/UTs",inplace=True) #Setting new index to State which will be used in the join
State_info.drop(index="India",axis=0,inplace=True) #Dropping the last row , can be included in the cleaning step
#print(State_info)
CovidInd = India_Covid.merge(State_info,how="inner",on="State/UTs",suffixes=["left","right"])
final = CovidInd.iloc[:,[0,1,2,3,7,9,13]]
#print(CovidInd)
#Regression
X = final.iloc[:,[4,5,6]] #Independent variables
Y = final.iloc[:,0] #Dependent variables
X = sm.add_constant(X) #Adds the Y intercept to our model
model = sm.OLS(Y,X).fit() #Passing the independent and dependent values
predictions = model.predict(X)
print(model.summary()) #Print out the regression statistics
---
The above written Python script helps me to do that .I am importing the real time state-wise Covid case numbers published by the Indian govt via their website . From Wikipedia , I have imported some population and related information. I will join these two tables to get a final table which will have my dependent variables (urban , rural population , density , sex ratio etc.) and my independent variable (Confirmed cases) for states and via regression i will establish a linear relationship among them. States with no confirmed cases or will be left out of the regression
Results:
This is the output of the regression. it may differ when you run it since the govt Covid numbers are real-time and hence dynamic.
Interpretation:
Regression equation takes the form Y = c + ax1 + bx2 + cx3 + .... with Y being the dependent and x1 , x2 , x3 ... xn being the independent variables.
For our regression , the equation is Confirmed Cases = 9.34e+04 -0.008*Rural_Population + 0.0038*Urban_Population - 101.3083*Sex_ratio
Adjusted R-Square - Reflects the fit and predictive power of the model with the value being 0 to 1. A high Adjusted R square value is desired. Here the Adj R-sq. value is 0.734 indicating moderately good fit.
Constant - Y-intercept on the X-Y graph. In other words , value of the dependent variable when independent variables are '0'. Constant
Coefficients - These are the coefficients of the independent variables in the regression equation . So from above equation these are the 'a' , 'b' , 'c' etc. values
Standard Error - Reflects the accuracy of coefficients. Lower value desired.
P-value - Should be less than 0.05. Indicates whether the independent variable can be used to predict changes in the dependent variable. In this case the urban , rural population variables are significant while 'sex_ratio' is not a significant variable for this model.
Sources: Python IDE - Pycharm ; Covid numbers - Gov website ; Population Data - Wikipedia
Disclaimer: This example should not be used as a predictive model for predicting any future cases. It was used to gain some hands-on on Python and to illustrate a regression example using real world data.
Troubleshooting the script: Please ensure you have all the required packages installed in your IDE so that the script runs seamlessly on copy-pasting.
Great article, but unfortunately this is not multivariate regression. This is multiple (linear) regression. Multivariate regression has only 1 predictor (or independent variable) while several dependent variables. In your code there is more than 1 predictor.
Great feat Saket Dhodapkar , FRM®.