Understanding PCA with an example
What is Principal Component Analysis?
Principal component analysis (PCA) is a statistical procedure to describe a set of multivariate data of possibly correlated variables by relatively few numbers of linearly uncorrelated variables.
The uncorrelated variables are created as linear combination of the original variables and in the decreasing order of importance so that the first component explains most of the original variation in the data.
Each component is orthogonal to the other component to explain the variation that is not already explained by other components.
So in layman’s term, if a dataset is explained by several different characteristics, PCA summarizes those characteristics to define fewer characteristics to explain the data well. Moreover, it does remove the redundancy or correlation that can exist when explaining data using large number of variables and create a list of uncorrelated variables.
Background Knowledge needed to understand PCA:
Variance: Variance is the spread of the data in a dataset. In PCA, the variables are transformed in such a way that they explain variance of the dataset in decreasing manner.
Co-variance: Covariance provides a measure of the strength of the correlation between two or more sets of random variates. In PCA, the variables are to be transformed in such a way that their covariance is 0.
If a dataset is explained by multiple variables, a variance-covariance matrix can be drawn using those variables where the diagonal of the matrix defines the variance explained by those variables and non-diagonal elements are the pairwise covariance.
The purpose of PCA is to transform this matrix in such a way that all non-diagonal elements are 0. Then we have the components which are uncorrelated and each component explains a percentage of the total variance.
This transformation is achieved by the eigenvector decomposition of the variance-covariance matrix.
In PCA, the eigenvectors determine the direction of maximum variability and the eigenvalues specify the variance in each direction.
Objective of PCA:
With a large number of variables, the variance-covariance matrix may be too large to study and interpret properly. There would be too many pairwise correlations between the variables to consider. Graphical display of data may also not be of particular help in case the data set is very large.
To interpret the data in a more meaningful form, it is therefore necessary to reduce the number of variables to a few, interpretable linear combinations of the data. Each linear combination will correspond to a principal component.
PCA is mostly used as a tool in exploratory data analysis and for making predictive models.
How to construct principal components:
Step 1: from the dataset, standardize the variables so that all variables are represented in a single scale
Step 2: construct variance-covariance matrix of those variables
Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the components of the dataset
Step 4: Reorder the matrix by eigenvalues, highest to lowest. This gives the components in order of significance
Step 5: Keep the top n-components which together explain 75%-80% variability of the dataset
Step 6: create a feature vector by taking the eigenvectors that are kept in step 5, and forming a matrix with these eigenvectors in the columns
Step7: take the transpose of the feature vector and multiply it on the left of the original data set, transposed. The values obtained are the principal scores
How many components to retain?
Since the objective of PCA is to retain as few dimensions as possible with as much of the original information, it is very important to find out the optimized number of components to keep. There are three methods that are commonly used:
- Cumulative proportion of variance explained: In real life data the number of components that explain 75-80% of the variance is chosen as the optimized number of components to be used
- Eigenvalues greater than unity: In a correlation matrix the variance of each variable is one. A component having variance less than one is presumed to contain less information than a single variable and is discarded
- Scree plot of eigenvalues: A graph of the eigenvalue and the number of components is made and a natural breakpoint is one where the slope of the graph is steep to the left of the breakpoint and not as steep to the right
Importance of standardization:
If the units of measurement of different variables are not the same then standardized data are preferable. It cancels out the bias that can occur due to use of different scales.
PCA achieves higher level of dimension reduction if the variables in the dataset are highly correlated. The correlation level of the variables can be tested using Barlett’s sphericity test. This is to test whether the data follows a spherical distribution which results from uncorrelated data.
Principal Component Regression:
Principal Components Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a degree of bias to the regression estimates, principal components regression reduces the standard errors. It is hoped that the net effect will be to give more reliable estimates.
In PC regression, the original predictor variables are replaced by the uncorrelated principal components.
It follows the normal assumptions of ordinary least square regression.
An example of PCA regression in R:
Problem Description: Predict the county wise democrat winner of USA Presidential primary election using the demographic information of each county.
Data Description: The dataset is obtained from Kaggle dataset. It consists of county wise demographic information of all 50 states in USA and primary presidential election results of 28 states.
More details about the dataset can be found in the following link:
https://www.kaggle.com/benhamner/2016-us-election
Exploratory Data Analysis:
There are 51 demographic variables for each county in the US. Based on some initial analysis, 33 were kept for further analysis.
From the below correlation matrix, pairwise correlation can be observed between variables:
Barlett’s sphericity test was performed to detect if we can or cannot summarize the information provided by the initial variables in a few number of factors. “Psych” package in R can be used to perform this test.
Purpose of PCA in this dataset:
To predict the winner between Clinton and Sanders, ordinary least square (OLS) logistic regression can be used by taking the independent binary variable to be 1 if Clinton wins and 0 if Sanders wins.
The problem with OLS logistic regression can be multicollinearity. From above exploratory analysis, pairwise correlation between variables is quite evident. Moreover, from these many variables it is difficult to manually detect the variables to keep.
Principal component analysis can be used in this situation to find out fewer uncorrelated components which can be further used in logistic regression as independent variables.
PCA procedure in R:
Before doing PCA, it is very important to standardize variables to remove scaling bias. That can be done using scale function in R.
Also, PCA can only be performed on numerical variables. So all the categorical variables are removed from the dataset.
Principal component analysis is done in R using princomp function. It is part of the stats package.
The syntax of the function is :
princomp((x, cor = FALSE, scores = TRUE, covmat = NULL)
where:
‘x’ is a numeric matrix or data frame which provides the data for the principal components analysis.
‘cor’ is a logical value indicating whether the calculation should use the correlation matrix or the covariance matrix. (The correlation matrix can only be used if there are no constant variables.)
‘scores’ is a logical value indicating whether the score on each principal component should be calculated.
‘covmat’ is a covariance matrix, or a covariance list as returned by cov.wt (and cov.mve or cov.mcd from package MASS). If supplied, this is used rather than the covariance matrix of x.
Summary Result:
Observing the summary result, 8 principal components were chosen which explained 80% variance of the dataset. (Look out for the cumulative proportion row to get the total variance explained)
The number of components to be used for further analysis can also be determined using the following scree plot: (look out for the elbow shaped point)
The principal component scores corresponding to the 1st 8 components can be obtained as $scores.
These 8 principal component scores can be used as the predictor variables to build logistic regression model which predicts binary outcome stating the winner.
The fit of the model can be tested using normal logistic regression diagnostic tests.
The entire code and output can be found in the following link:
References:
http://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Principal_Components_Regression.pdf
https://onlinecourses.science.psu.edu/stat505/node/49
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Good article. Could you please provide a numerical example PCA for 2 D data set. So that we can replicate all steps of PCA to get clarity on this method.
PCA is not very useful for feature reduction in classification problems, since the class variable is disregarded.
Great article Subhasree, another great use of PCA is dimensionality reduction before doing a cluster analysis and it helps to build an hypothesis on how many clusters you might see during the cluster analysis..