Dimensionality Reduction - PCA

Rahul Jain

Published May 18, 2019

Howdy Folks !! I had been longing to write an article on PCA since long but I was just not getting time. Today i resolved to write it anyhow :) So here we go...

You must have heard about this term PCA quite often if you are in Data Science Space. I have seen people using it casually in their models without understanding the underlying process. My objective with this blog is to demystify PCA and democratise the inner understanding of PCA to the larger audience.

Before getting into PCA, I would like to touch upon why is it being used. For that we would need to understand what is Multicollinearity and Curse of higher Dimensionality. I swear I am going to keep this as simple as possible.. No Fancy words, no cutting corners in explaining the logics. So hang in with me.

Curse of Higher Dimensionality !!

While you are dealing with toy dataset, you encounter limited number of feature and that too labeled. Let's take an example, say Titanic Data Set. You know the data is the passenger list of Titanic, that you have a feature which tells the gender, you have one for fare and so on. So two things.. Limited number of features and properly labeled. This is often not the case when you are dealing with Real world data. For instance, in medicine research, You may get data about thousands and thousands of genes which enables you to analyse which medicine is suitable for which patient. Now there is no way you could identify which gene is more important and how many of them should you be keeping in scope for your research. In such situations, it is very probable that your final model is overfit and doesn't give you good results. Then What should you do. This is where PCA shines !!

Multicollinearity !!

Multicollinearity is a situation where in many of your predictor variables are highly correlated. For instance, in our Adult income data set, You have feature like, total Assets, total investments, total savings, total current account balance. Think of a rich guy, he will rank high on all these attributes. Obviously if he has a lot of money, he will have high Net worth, in all likelihood..heavily invested, Hefty bank balance and so on. in other words, you can say that these features are highly correlated among themselves. When we have such situation, we may inadvertently give a lot of weightage to one aspect of the feature which may lead to unstable model. Then what ? you guessed it right.. Time to consider PCA !!

There are many conventional methods to reduce the dimensions for e.g.

Missing Values - If you see more than 90 percentage of the values missing, you can write off that feature. Off course before that you need to understand the reason behind missing data.

Low Variance - If the distinct values of the feature are very tightly nit and the variance is very low, you are looking at pretty much a constant feature, You can skip it out of model.

Random Forest/Decision Trees - You can create the model with Random Forest and check the feature Importance. This will give you an idea which features are most imp.

High Correlation - You do a pair plot of all your features and exclude those features which highly correlate with other predictor features but poorly correlate with target variable.

Backward Elimination : This is a brute force approach, where you create multiple models with permutation and combination of feature space and arrive at the Model which has the best Performance measure.

For more complicated datasets, we need to rely more on the maths than your own logic. These includes:

Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Factor Analysis
Singular Value Decomposition

Though all the above analysis techniques leads to a common goal but the route they traverse is very different. For now we will try to understand PCA. I will surely try to cover a few others in my future blogs if time permits.

Let's take an intuitive approach to understand what we are trying to achieve with PCA.

Let's say we have two features which are highly correlated, in order to get rid of one dimension, what you can do is project those point on various axises and find out the one which has the highest variance. In the above snapshot, we try to project the point on x1 axis, x2 axis and the best fit line (If we were to regress one feature on another). Clearly the points on Principal component (or best fit line) separates the points best and is our best candidate. This is essentially what PCA does. It takes into consideration all the features and finds out the Principal component axis which gives the best variance among those features. These axis are called Eigen Vectors and the factor by which the vector is stretched or squeezed are called Eigen Values.

Steps to perform PCA !!

Identify the use case for performing PCA.
Standardise your feature space.
Create the Covariance Matrix.
Calculate the Eigen Vectors and Eigen Values.
Pick the Principal Components by set Rules.
Proceed ahead with the Model building.

What I am going to do is, I will take a dataset and clearly tell you how is each step performed. I have taken a California House Price prediction dataset and applied PCA to it. We have below features as predictors and we are trying to predict the house price in a locality.

We have longitude and latitude of the locality. We have median age of the houses. Total rooms, total bedrooms, population in the locality, households, median income of the locality population, proximity from ocean and our target variable as Median House Value.

One important point to Note, PCA doesn't consider the Target variable in the analysis. Since we look for the Variance among the predictor features only. So you need to exclude the target variable for processing.

Step1 To identify the applicability of PCA, you need to check if you have multicollinearity in your predictor variables. You can check that with various methods like Pair Plot, Heat map, or VIF method. Here is a quick look through that...

Looking at above scatter plot, we identify that there is a strong correlation between total_rooms, total_bedroom, population, households. Which logically makes sense too. Let's look at Heat map to check the Pearson correlation coefficients.

Looking at the heat map, we are pretty much sure that there is high correlation between rooms, bedrooms, population etc. We also see that there is high negative correlation between longitude and attitudes, the cor coefficient being in the neighbourhood of .90. Let's look at VIF Number for final decision making.

The features which have values greater than 10 are highly correlated with other features. total_rooms, total_bedrooms, households are outright PCA candidates. Longitude and Latitude are also cutting close to the boundary criteria, so we are now confident that we need to go for PCA.

Step 2 : Since we will be dealing with Co-variance matrix to figure out which features vary together, first step we need to do is to standardise you feature space. This is due to the fact that the Co-variance is not scaled (you can compare the Correlation matrix which is instead scaled). Standardising is a simple process where in we take the standard deviation of each feature and divide each term with that to arrive at a new feature value which tells us how many standard deviations away is that point from the mean.

You can see above, how the values changed in the standardised feature space.

Step2: Creating the Co-variance matrix. First let's understand what is Co-variance. Just to give you perspective on how is Covariance calculated. Below are some formulas..

Lets first look at Variance :

I am sure you must be knowing about Variance, above is the formula for the same. Basically variance tells you how is your data spread across your mean. Now let's c Co-variance.

While Variance is used for uni-variant analysis, Co-variance is used for Bi-variant analysis. Co-Variance tells us how does the two features vary with each other.

We have 9 features here, so that Co-variance matrix will be a 9 X 9 matrix. Basically we take each combination of predictor variables and find out Co-variance for it. Then we neatly stack all of them in a tabular way to get a matrix.

As you see above, the diagonal gives us the Variance of that feature and other points gives us the Co-variance between each two variables. This matrix is symmetrical, since the Cov(X,Y) is same as Cov(Y,X). Now we have a numerical measure of how does any two feature in our feature space vary together. This will be our base to identify the Eigen Values and Eigen Vectors. Basically PCA algorithm will look at all the features which vary together and find out a merged vector which explains the maximum variance. I will explain that intuitively in the end as well. So follow along..

Step3 : Calculate Eigen Vectors and Eigen Values. Let's first understand how are these calculated internally..

The above derivation forms the basis of finding the eigen values and eigen vectors. A above in our case is the Co-Variance matrix. Lambda is the Eigen Value and Vector v is the Eigen vectors. What essentially the above expression is saying, the Matrix Vector product is same as scaling the Eigen Vector by some value called Eigen Value. So essentially finding the Eigen Values/Vectors comes down to solving the above expression for lambda and vector v.

If you simplify the above expression, you will get a matrix which is as above. Now we need to do is equate the determinant of the above matrix to 0. Thats it, we will get our Eigen Values/Vectors. But don't worry, we don't have to calculate this manually, there are numerous ways to calculate it automatically. Lets see how we can calculate these ...

We can either use numpy library (linalg.eig) or use sklearn library (decomposition.PCA).

Both will give you the same results. You can use either of them. Now since we have derived the Eigen Values/Vectors. We will learn the Rules how to pick the Principal Components.

Step 3 : Picking the Principal Components as per the Rules.

Rule 1 - EigenValue Criteria

Please Note : The sum of Eigen Values represents the number of variables entered into the PCA. An Eigen value of 1 would mean that the component would explain about one variable worth of variability. So with this logic, we will select the principal components which have values greater than 1. As that means it at-least explains one variable's worth variability.

Note, The principal Components you get are in descending order of their importance.

Also important to note, the Sum of all Eigen Values will be equal to Number of Features which you are analysing !!

Results - EigenValue Criteria

So, as per the EigenValue Criteria, We will keep the PC1, PC2, PC3 components. We will also keep PC4 & PC5 in standby as they are very close to 1

Rule 2 - Proportion of Variance Explained Criteria

This Rule depends on the Business Requirement. How much variance you want your Predictor Variables to explain. For that you need to understand, that the initial PCs explain the variability better and the later PCs, doesn't. So we need to find the cumulative sum of the percentages to figure that out.

Results - Proportion of Variance Explained Criteria

If we see, the initial 4 PCs, combined explain 88% of variability and taking 5 components, explains 96.9% variability. Now if your business need more explainability and is fine with less accuracy, You can take 4 components but if they need higher accuracy, then we should be taking 5 PCs.

Rule 3 - Scree Plot Criteria

A scree plot is a graphical plot of the Eigen values against the Principal components. The screen plot says, we should stop taking additional PCs when the graph gets flattened out. So lets plot that and see.

Results - Scree Plot Criteria

If you see in the Screen plot , the line y = 1 gives us the Rule 1 cut off which is first 3 PCs. But carefully looking at the graph, the line gets flatted from PC6. So as per the Rule 3, We will be taking the first 5 PCs.

Rule 4 - Evaluating the Component Matrix

Understanding and relating with Component Matrix

The component matrix is a square matrix which has PCs as columns and standardised features as Index/rows We call PCs as linear combination of the variables and the value in the cross table gives us the magnitude of the same.

To identify, the components which does/doesn't have higher say in the PCs, we will consider only those features where we have component contribution as greater than 0.5

PC1 - Associated Features - room_z, bedrm_z, popn_z, hhold_z - Selected

This is exactly as per our Multicollinearity check,. These 4 features were highly correlated hence our PCA model has collated the common variance of these features and made one PC from it.

PC2 - Associated Feature - long_z & lat_z - Selected

This was our second observation from the heatmap and VIF. long and latt were highly negatively correlated. So PCA has extracted the variance from these two features and created PC2

PC3 - Associated Features - hage_z & minc_z - Selected

This PC has a strong variance coming from Median Income and less so from house age.I will take this PC as representing variance from Median Income

PC4 - Associated Feature - hage_z and ocean_z - Selected

This Principal components has more variance coming from ocean_proximity than house age, So i will explain this PC as ocean_proximity variance captured.

PC5 - Associated Feature - hage_z and ocean_z again - Not Selected

Since, there wasn't any other feature contributing to variance, the model again tried extracting the variance from these two variables, So we can ignore this PC then.

CONCLUSION- So, finally considering all the Rules and our business justification, we will go ahead with the first 4 PCs.

If you in the process, we reduced our 9 features to just 4. Reduced Dimensions !!

A model with these 4 PCs will have much better accuracy than a model having the original 9 features. To sum up we can say, by using PCA here, we achieved below:

Converted our model to a simpler model by reducing the dimensions.
Got rid of MultiCollinearity .
Still preserving the ~90 of variability explanation.

The only place where PCA hits you is intuitive explanation to your clients. As they are better placed to relate with the original features than Principal Components. You may have to put in more efforts to make them understand the Model.

Hope my article helped you understand the nitty-gritty of PCA. Happy Learning !!

To view or add a comment, sign in

Dimensionality Reduction - PCA

Rahul Jain

Rule 1 - EigenValue Criteria

Results - EigenValue Criteria

Rule 2 - Proportion of Variance Explained Criteria

Results - Proportion of Variance Explained Criteria

Rule 3 - Scree Plot Criteria

Results - Scree Plot Criteria

Rule 4 - Evaluating the Component Matrix

Understanding and relating with Component Matrix

PC1 - Associated Features - room_z, bedrm_z, popn_z, hhold_z - Selected

PC2 - Associated Feature - long_z & lat_z - Selected

PC3 - Associated Features - hage_z & minc_z - Selected

PC4 - Associated Feature - hage_z and ocean_z - Selected

PC5 - Associated Feature - hage_z and ocean_z again - Not Selected

CONCLUSION- So, finally considering all the Rules and our business justification, we will go ahead with the first 4 PCs.

More articles by Rahul Jain

Others also viewed

More Data Isn’t Always Better: Why Bayesian Methods Matter

L1, L2 Regularization – Why needed/What it does/How it helps?

Boost Your Model's Reliability with Bayesian Methods for Predictive Uncertainty

K-Means Clustering, Centroid, Inertia, Convergence & more.

The Power of Feature Selection

Cross-Validation — A Better Way to Evaluate Machine Learning Models

Effective Imputation Techniques for Handling Null Values in Data Cleaning

How to Choose the Right Machine Learning Model: A No-Nonsense Guide That You'll Actually Read 🤖

How accurate is accuracy?

Understanding Support Vector Machine

Explore content categories

Rule 1 - EigenValue Criteria

Results - EigenValue Criteria

Rule 2 - Proportion of Variance Explained Criteria

Results - Proportion of Variance Explained Criteria

Rule 3 - Scree Plot Criteria

Results - Scree Plot Criteria

Rule 4 - Evaluating the Component Matrix

Understanding and relating with Component Matrix

PC1 - Associated Features - room_z, bedrm_z, popn_z, hhold_z - Selected

PC2 - Associated Feature - long_z & lat_z - Selected

PC3 - Associated Features - hage_z & minc_z - Selected

PC4 - Associated Feature - hage_z and ocean_z - Selected

PC5 - Associated Feature - hage_z and ocean_z again - Not Selected

CONCLUSION- So, finally considering all the Rules and our business justification, we will go ahead with the first 4 PCs.

More articles by Rahul Jain

Text Data Processing with Deep Learning (Word Embedding,RNN, LSTM)

Brain Teasers - Advanced SQL

Bayesian Statistics - A departure from Frequentist Statistics.

Activation Functions in Neural Networks

Introduction to Gradient Descent

Data Analytics implementation in Daily Life !!

General Elections 2019 Prediction >>

Sentiment Analysis on Elon Musk's Tweets>>

Out-of-Bag Score/Error - Random Forest

Is flying safe ? How can you decide..

Others also viewed

More Data Isn’t Always Better: Why Bayesian Methods Matter

L1, L2 Regularization – Why needed/What it does/How it helps?

Boost Your Model's Reliability with Bayesian Methods for Predictive Uncertainty

K-Means Clustering, Centroid, Inertia, Convergence & more.

The Power of Feature Selection

Cross-Validation — A Better Way to Evaluate Machine Learning Models

Effective Imputation Techniques for Handling Null Values in Data Cleaning

How to Choose the Right Machine Learning Model: A No-Nonsense Guide That You'll Actually Read 🤖

How accurate is accuracy?

Understanding Support Vector Machine

Similar topics

Understanding Overfitting In Predictive Analytics

Machine Learning Models For Healthcare Predictive Analytics

How to Use Predictive Analytics in Medicine

Explore content categories