Machine Learning Algorithm - Dimensionality Reduction (Part 4 of 12)
Principal Component Ananlysis :
Principal components analysis is a procedure for identifying a smaller number of uncorrelated variables, called "principal components", from a large set of data. The goal of principal components analysis is to explain the maximum amount of variance with the fewest number of principal components. Principal components analysis is commonly used in the social sciences, market research, and other industries that use large data sets.
Principal components analysis is commonly used as one step in a series of analyses. You can use principal components analysis to reduce the number of variables and avoid multicollinearity, or when you have too many predictors relative to the number of observations.
Example
A consumer products company wants to analyze customer responses to several characteristics of a new shampoo: color, smell, texture, cleanliness, shine, volume, amount needed to lather, and price. They perform a principal components analysis to determine whether they can form a smaller number of uncorrelated variables that are easier to interpret and analyze. The results identify the following patterns:
- Color, smell, and texture form a "Shampoo quality" component.
- Cleanliness, shine, and volume form an "Effect on hair" component.
- Amount needed to lather and price form a "Value" component.
Ref : http://setosa.io/ev/principal-component-analysis/
Principal Components Regression (PCR)
In principal components regression, we first perform principal components analysis (PCA) on the original data, then perform dimension reduction by selecting the number of principal components (m) using cross-validation or test set error, and finally conduct regression using the first m dimension reduced principal components.
Principal components regression forms the derived input columns zm=Xvm and then regresses yon z1,z2,⋯,zm for some m≤p
Principal components regression discards the p–m smallest eigenvalue components.
By manually setting the projection onto the principal component directions with small eigenvalues set to 0 (i.e., only keeping the large ones), dimension reduction is achieved. PCR is very similar to ridge regression in a certain sense. Ridge regression can be viewed conceptually as projecting the y vector onto the principal component directions and then shrinking the projection on each principal component direction. The amount of shrinkage depends on the variance of that principal component. Ridge regression shrinks everything, but it never shrinks anything to zero. By contrast, PCR either does not shrink a component at all or shrinks it to zero.
Partial least squares regression?
Partial least squares (PLS) regression is a technique that reduces the predictors to a smaller set of uncorrelated components and performs least squares regression on these components, instead of on the original data. PLS regression is especially useful when your predictors are highly collinear, or when you have more predictors than observations and ordinary least-squares regression either produces coefficients with high standard errors or fails completely.
PLS regression is primarily used in the chemical, drug, food, and plastic industries. A common application is to model the relationship between spectral measurements (NIR, IR, UV), which include many variables that are often correlated with each other, and chemical composition or other physio-chemical properties. In PLS regression, the emphasis is on developing predictive models. Therefore, it is not usually used to screen out variables that are not useful in explaining the response.
Minitab uses the nonlinear iterative partial least squares (NIPALS) algorithm developed by Herman Wold. The algorithm reduces the number of predictors using a technique similar to principal components analysis to extract a set of components that describes maximum correlation between the predictors and response variables. If the predictors are highly correlated, or if a smaller number of components perfectly model the response, then the number of components in the model might be much less than the number of predictors. Minitab then performs least-squares regression on these uncorrelated components. Also, cross-validation is often used to select the components that maximize the model's predictive ability.
PLS regression fits multiple response variables in a single model. Because PLS regression models the response variables in a multivariate way, the results can differ significantly from those calculated for the response variables individually. You should model multiple responses in a single PLS regression model only when they are correlated.
Ref : http://www.ats.ucla.edu/stat/sas/library/pls.pdf
Sammon Mapping :
Sammon mapping or Sammon projection is an algorithm that maps a high-dimensional space to a space of lower dimensionality by trying to preserve the structure of inter-point distances in high-dimensional space in the lower-dimension projection. It is particularly suited for use in exploratory data analysis. The method was proposed by John W. Sammon in 1969.It is considered a non-linear approach as the mapping cannot be represented as a linear combination of the original variables as possible in techniques such as principal component analysis, which also makes it more difficult to use for classification applications
Denote the distance between ith and jth objects in the original space by {\displaystyle \scriptstyle d_{ij}^{*}}d* , and the distance between their projections by d {\displaystyle \scriptstyle d_{ij}^{}}. Sammon's mapping aims to minimize the following error function, which is often referred to as Sammon's stress or Sammon's error:
The minimization can be performed either by gradient descent, as proposed initially, or by other means, usually involving iterative methods. The number of iterations need to be experimentally determined and convergent solutions are not always guaranteed. Many implementations prefer to use the first Principal Components as a starting configuration.
The Sammon mapping has been one of the most successful nonlinear metric multidimensional scaling methods since its advent in 1969, but effort has been focused on algorithm improvement rather than on the form of the stress function. The performance of the Sammon mapping has been improved by extending its stress function using left Bregman divergence and right Bregman divergence
https://www.cse.msu.edu/~cse802/Papers/sammon.pdf
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0910/henderson.pdf
Multidimensional Scaling :
Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. It refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in a distance matrix. An MDS algorithm aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible. Each object is then assigned coordinates in each of the N dimensions. The number of dimensions of an MDS plot N can exceed 2 and is specified a priori. Choosing N=2 optimizes the object locations for a two-dimensional scatterplot.
1. If the input data dissimilarities, the function is never decreasing. If the input data are similarities, the function is never increasing.
2. In some cases, however, it is better to rerun the data collection on the subset of items. This is because the presence of the other items can evoke additional dimensions/attributes of comparison that could affect the way items in the subset are viewed.
3. Fortunately, a very simple technique exists to deal with this problem. The technique is called property fitting (PROFIT).
For examples you can refer :
https://www.utdallas.edu/~herve/Abdi-MDS2007-pretty.pdf
http://forrest.psych.unc.edu/teaching/p208a/mds/mds.html
Projection Pursuit :
Friedman and Tukey (1974) introduced the term "projection pursuit" for a technique for the exploratory analysis of multivariate data sets; the method seeks out "interesting" linear projections of the multivariate data onto a line or a plane. In this paper, we show how to set Friedman and Tukey's idea in a more structured context than they offered. This makes it possible to offer some suggestions for the reformulation of the method, and thence to identify a computationally efficient approach to its implementation. We illustrate its application to empirical data, and discuss its practical attractions and limitations. Extensions by other workers to problems such as non-linear multiple regression and multivariate density estimation are discussed briefly within the same framework.
The model consists of linear combinations of non-linear transformations of linear combinations of explanatory variables. The basic model takes the form :
Where x is a column vector containing a particular row of the design matrix X which contains p explanatory variables (columns) and n observations (row). Here Y is a particular observation variable (identifying the row being considered) to be predicted, {β} is a collection of r vectors (each a unit vector of length p) which contain the unknown parameters. Finally r is the number of modelled smoothed non-parametric functions to be used as constructed explanatory variables. The value of r is found through cross-validation or a forward stage-wise strategy which stops when the model fit cannot be significantly improved. For large values of r and an appropriate set of functions f, the PPR model is considered a universal estimator as it can estimate any continuous function in Rp.
Thus this model takes the form of the basic additive model but with the additional β component; making it fit β x {\displaystyle \beta _{j}'x} rather than the actual inputs x. The vector β X {\displaystyle \beta _{j}'x} {\displaystyle \beta _{j}'X} is the projection of X onto the unit vector β, where the directions β are chosen to optimize model fit. The functions f are unspecified by the model and estimated using some flexible smoothing method; preferably one with well defined second derivatives to simplify computation. This allows the PPR to be very general as it fits non-linear functions f of any class of linear combinations in X. Due to the flexibility and generality of this model, it is difficult to interpret the fitted model because each input variable has been entered into the model in a complex and multifaceted way. Thus the model is far more useful for prediction than creating a model to understand the data.
Advantages of PPR estimation
- It uses univariate regression functions instead of their multivariate form, thus effectively dealing with thecurse of dimensionality
- Univariate regression allows for simple and efficient estimation
- Relative to generalized additive models, PPR can estimate a much richer class of functions
- Unlike local averaging methods (such ask-nearest neighbors), PPR can ignore variables with low explanatory power.
Disadvantages of PPR estimation:
- PPR requires examining an M-dimensional parameter space in order to estimate{\displaystyle \beta _{j}} β.
- One must select the smoothing parameter for {\displaystyle f_{j}}f.
- The model is often difficult to interpret
Discriminant Analysis (LDA, MDA, QDA, FDA) :
Linear and quadratic discriminant analysis are considered in the small-sample, high-dimensional setting. Alternatives to the usual maximum likelihood (plug-in) estimates for the co-variance matrices are proposed. These alternatives are characterized by two parameters, the values of which are customized to individual situations by jointly minimizing a sample-based estimate of future misclassification risk. Computationally fast implementations are presented, and the efficacy of the approach is examined through simulation studies and application to data. These studies indicate that in many circumstances dramatic gains in classification accuracy can be achieved.
MDA is a statistical technique used to classify an observation into one of several a priori groupings dependent upon the observation's individual characteristics. It is used primarily to classify and/or make predictions in problems where the dependent variable appears in qualitative form, e.g., male or female, bankrupt or non-bankrupt. Therefore, the first step is to establish explicit group classifications. The number of original groups can be two or more.
Fisher's linear discriminant analysis is a valuable tool for multi group classification. With a large number of predictors, one can find a reduced number of discriminant coordinate functions that are “optimal” for separating the groups. With two such functions, one can produce a classification map that partitions the reduced space into regions that are identified with group membership, and the decision boundaries are linear. This article is about richer nonlinear classification schemes. Linear discriminant analysis is equivalent to multi response linear regression using optimal scoring to represent the groups. In this paper, we obtain nonparametric versions of discriminant analysis by replacing linear regression by any nonparametric regression method. In this way, any multi response regression technique (such as MARS or neural networks) can be post processed to improve its classification performance.