LINEAR REGRESSION : THE IDEA

LINEAR REGRESSION : THE IDEA

The Linear Regression (LR) is simplest yet one of the most powerful tools we use both in deductive and predictive analyses. For years now, I have used it for parameter estimation, hypothesis testing and of-course generating predictions in context of my academic research. The idea essentially is to determine slope(b1) and intercept (b0) of the straight line (y= b0+b1*x) that best describes the relationship between variables x and y in the data set D={(x,y)} of N (say) data records. And unlike several other tools of predictive analytics LR is unique since mathematics can afford analytical formulas for the parameters b0 and b1. This makes it very easy to understand and code especially in python.

b1 = Covaraince(x,y)/Varaince(x)

b0 = mean(y) - b1* mean(x)

Coding this is simply a cake

def LinearRegression1D_fit(x,y):
 
   try:
        b0 = b1 = sigma = None
        bNum = stat.covariance(x,y)
        bDen = stat.variance(x)
        xmean = stat.mean(x)
        ymean = stat.mean(y)
        if bDen>0: 
            b1 = bNum/bDen
            b0 = ymean - b1*xmean
            
        return b0,b1
    except Exception as err:
        print (err)

One can easily generate the model line now:

yLine = [b0+b1*xi for xi in x]

Each Y_i in the list yLine is a modeled value of variable y at some value of variable xi, and with a little research one can find the analytical formula for the variance of Y_i. The variance(Y_i) is an important idea since it allows us to construct confidence intervals about the model line.

def LinearRegression1D_LineVariance(x,mse):
   '''
    x   : list of regressors
    mse : mean squared residuals

   '''
 
   try:
        n = len(x)
        xmean = stat.mean(x)
        xdevsq = stat.sq_sum([xi-xmean for xi in x])
        if xdevsq>0 and n > 0:
            ret = []
        
            for xh in x:
                term = 1./n+math.pow(xh-xmean,2)/xdevsq
                term = mse*term
                ret.append(term)
            return ret
        return None
    except Exception as err:
        tools.print_error(err,'')
        
    

Both the above code snippets use some of the other commonly used features, I have put into a generic module called "stat", and I will happily share full package on request.

The picture in above show data points, the model line (broken red) and 95% confidence intervals on y for a toy data set. Below is the test script for a small module that I can happily share on request

x = [1,2,3,4,5,6,7]
y = [2,7,8,13,14,20,19]

#Set and Fit Model Parameters
b0,b1 = linreg.LinearRegression1D_fit(x,y)

#Obtain Model Line, and Residuals
y_fit = linreg.LinearRegression1D_predict(x,b0,b1)
y_res = linreg.LinearRegression1D_residuals(x,y,b0,b1)

#Mean square of residuals
mse = sum([yi*yi for yi in y_res])/(len(y_res)-2)

#Variance (Y) on points of fit line
y_fit_var = linreg.LinearRegression1D_LineVariance(x,mse)

#Constructing the upper and lower CI (95%)
y_ci_low = [yi-2*math.sqrt(yvar) for yi,yvar in zip(y_fit,y_fit_var)]
y_ci_hi = [yi+2*math.sqrt(yvar) for yi,yvar in zip(y_fit,y_fit_var)]

#Plot the results
plt.scatter(x,y)
plt.plot(x,y_fit,'r--')
plt.plot(x,y_ci_low,'r')
plt.plot(x,y_ci_hi,'r')
plt.ylabel('Y-variable')
plt.xlabel('X-variable')







 
  

Here I think is a nice point to close this basic introduction to linear regression modeling but there is definitely a lot more to it : What else can be done? Can we use LR for modeling quadratic or higher order relationship between x and y (y = a x**2+bx+c) ? Are there any assumptions, that get imposed when we use this algorithm? Hopefully, I will write a bit more in future to examine all that :).


To view or add a comment, sign in

Others also viewed

Explore content categories