Machine Learning Algorithm - Regularization Part 2 of 12

Regularization:  Part 2 of 12

 Ridge Regression :

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to give estimates that are more reliable. Another biased regression technique, principal components regression, is also available in NCSS. Ridge regression is the more popular of the two methods.

Multicollinearity, or collinearity, is the existence of near-linear relationships among the independent variables. For example, suppose that the three ingredients of a mixture are studied by including their percentages of the total. These variables will have the (perfect) linear relationship: P1 + P2 + P3 = 100. During regression calculations, this relationship causes a division by zero which in turn causes the calculations to be aborted. When the relationship is not exact, the division by zero does not occur and the calculations are not aborted. However, the division by a very small quantity still distorts the results. Hence, one of the first steps in a regression analysis is to determine if multicollinearity is a problem.

Ref. : 

http://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf

http://web.as.uky.edu/statistics/users/pbreheny/764-F11/notes/9-1.pdf

 

Least Absolute Shrinkage and Selection Operator [LASSO]

In statistics and machine learning, lasso (least absolute shrinkage and selection operator) (also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. It was introduced by Robert Tibshirani in 1996 based on Leo Breiman’s Nonnegative Garrote. Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear.

Though originally defined for least squares, lasso regularization is easily extended to a wide variety of statistical models including generalized linear models, generalized estimating equations, proportional hazards models, and M-estimators, in a straightforward fashion. lasso’s ability to perform subset selection relies on the form of the constraint and has a variety of interpretations including in terms of geometry, Bayesian statistics, and convex analysis.

 Lasso was originally introduced in the context of least squares, and it can be instructive to consider this case first, since it illustrates many of lasso’s properties in a straightforward setting.

Consider a sample consisting of n cases, each of which consists of p covariates and a single outcome. Let y base i  be the outcome and x base i :=(x1,x2,x3...x base p) pow T  be the covariate vector for the ith case. Then the objective of lasso is to solve 

 

Here   is a prespecified free parameter that determines the amount of regularisation. Letting   be the covariate matrix, so that   and   is the ith row of  , we can write this more compactly as where   is the standard   norm.

Since  , so that it is standard to work with centered variables. Additionally, the covariates are typically standardized   so that the solution does not depend on the measurement scale.

It can be helpful to rewrite in the so-called Lagrangian form where the exact relationship between   and   is data dependent.

Ref : http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20060011038.pdf

 Elastic Net :

In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

Specification

The elastic net method overcomes the limitations of the LASSO (least absolute shrinkage and selection operator) method which uses a penalty function based on  

Use of this penalty function has several limitations.For example, in the "large p, small n" case (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part to the penalty ( ||beta||sq), which when used alone is ridge regression (known also as Tikhonov regularization). The estimates from the elastic net method are defined by 

The quadratic penalty term makes the loss function strictly convex, and it therefore has a unique minimum. The elastic net method includes the LASSO and ridge regression: in other words, each of them is a special case where 

 

 

  or

 

 

  . Meanwhile, the naive version of elastic net method finds an estimator in a two-stage procedure : first for each fixed   it finds the ridge regression coefficients, and then does a LASSO type shrinkage. This kind of estimation incurs a double amount of shrinkage, which leads to increased bias and poor predictions. To improve the prediction performance, the authors rescale the coefficients of the naive version of elastic net by multiplying the estimated coefficients by  (1+ lambda base 2 .)

Examples of where the elastic net method has been applied are:

  • Support vector machine

  • Metric learning

  • Portfolio risk management

           Ref : http://web.stanford.edu/~hastie/TALKS/enet_talk.pdf

 Least Angle Regression :

                    In statistics, least-angle regression (LARS) is an algorithm for fitting linear regression models to high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani.

Suppose we expect a response variable to be determined by a linear combination of a subset of potential covariates. Then the LARS algorithm provides a means of producing an estimate of which variables to include, as well as their coefficients.

Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the L1 norm of the parameter vector. The algorithm is similar to forward stepwise regression, but instead of including variables at each step, the estimated parameters are increased in a direction equiangular to each one's correlations with the residual.

The advantages of the LARS method are:

  1. It is computationally just as fast as forward selection.
  2. It produces a full piecewise linear solution path, which is useful incross-validation or similar attempts to tune the model.
  3. If two variables are almost equally correlated with the response, then their coefficients should increase at approximately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable.
  4. It is easily modified to produce solutions for other estimators, like thelasso.
  5. It is effective in contexts wherep >> n (IE, when the number of dimensions is significantly greater than the number of points).

The disadvantages of the LARS method include:

  1. With any amount of noise in the dependent variable and with high dimensional multicollinear independent variables, there is no reason to believe that the selected variables will have a high probability of being the actual underlying causal variables. This problem is not unique to LARS, as it is a general problem with variable selection approaches that seek to find underlying deterministic components. Yet, because LARS is based upon an iterative refitting of the residuals, it would appear to be especially sensitive to the effects of noise. This problem is discussed in detail by Weisberg in the discussion section of the Efron et al. (2004) Annals of Statistics article.[2]Weisberg provides an empirical example based upon re-analysis of data originally used to validate LARS that the variable selection appears to have problems with highly correlated variables.
  2. Since almost all high dimensional data in the real world will just by chance exhibit some fair degree of collinearity across at least some variables, the problem that LARS has with correlated variables may limit its application to high dimensional data.

         Ref : http://statweb.stanford.edu/~tibs/ftp/lars.pdf      

To view or add a comment, sign in

More articles by Abhay Kumar

Explore content categories