LINEAR REGRESSION : THE IDEA
The Linear Regression (LR) is simplest yet one of the most powerful tools we use both in deductive and predictive analyses. For years now, I have used it for parameter estimation, hypothesis testing and of-course generating predictions in context of my academic research. The idea essentially is to determine slope(b1) and intercept (b0) of the straight line (y= b0+b1*x) that best describes the relationship between variables x and y in the data set D={(x,y)} of N (say) data records. And unlike several other tools of predictive analytics LR is unique since mathematics can afford analytical formulas for the parameters b0 and b1. This makes it very easy to understand and code especially in python.
b1 = Covaraince(x,y)/Varaince(x)
b0 = mean(y) - b1* mean(x)
Coding this is simply a cake
def LinearRegression1D_fit(x,y):
try:
b0 = b1 = sigma = None
bNum = stat.covariance(x,y)
bDen = stat.variance(x)
xmean = stat.mean(x)
ymean = stat.mean(y)
if bDen>0:
b1 = bNum/bDen
b0 = ymean - b1*xmean
return b0,b1
except Exception as err:
print (err)
One can easily generate the model line now:
yLine = [b0+b1*xi for xi in x]
Each Y_i in the list yLine is a modeled value of variable y at some value of variable xi, and with a little research one can find the analytical formula for the variance of Y_i. The variance(Y_i) is an important idea since it allows us to construct confidence intervals about the model line.
def LinearRegression1D_LineVariance(x,mse):
'''
x : list of regressors
mse : mean squared residuals
'''
try:
n = len(x)
xmean = stat.mean(x)
xdevsq = stat.sq_sum([xi-xmean for xi in x])
if xdevsq>0 and n > 0:
ret = []
for xh in x:
term = 1./n+math.pow(xh-xmean,2)/xdevsq
term = mse*term
ret.append(term)
return ret
return None
except Exception as err:
tools.print_error(err,'')
Both the above code snippets use some of the other commonly used features, I have put into a generic module called "stat", and I will happily share full package on request.
The picture in above show data points, the model line (broken red) and 95% confidence intervals on y for a toy data set. Below is the test script for a small module that I can happily share on request
x = [1,2,3,4,5,6,7]
y = [2,7,8,13,14,20,19]
#Set and Fit Model Parameters
b0,b1 = linreg.LinearRegression1D_fit(x,y)
#Obtain Model Line, and Residuals
y_fit = linreg.LinearRegression1D_predict(x,b0,b1)
y_res = linreg.LinearRegression1D_residuals(x,y,b0,b1)
#Mean square of residuals
mse = sum([yi*yi for yi in y_res])/(len(y_res)-2)
#Variance (Y) on points of fit line
y_fit_var = linreg.LinearRegression1D_LineVariance(x,mse)
#Constructing the upper and lower CI (95%)
y_ci_low = [yi-2*math.sqrt(yvar) for yi,yvar in zip(y_fit,y_fit_var)]
y_ci_hi = [yi+2*math.sqrt(yvar) for yi,yvar in zip(y_fit,y_fit_var)]
#Plot the results
plt.scatter(x,y)
plt.plot(x,y_fit,'r--')
plt.plot(x,y_ci_low,'r')
plt.plot(x,y_ci_hi,'r')
plt.ylabel('Y-variable')
plt.xlabel('X-variable')
Here I think is a nice point to close this basic introduction to linear regression modeling but there is definitely a lot more to it : What else can be done? Can we use LR for modeling quadratic or higher order relationship between x and y (y = a x**2+bx+c) ? Are there any assumptions, that get imposed when we use this algorithm? Hopefully, I will write a bit more in future to examine all that :).