Dimension Reduction Technique - Principal Component Analysis(PCA) using "iris" data:
Before discussing nitty-gritty of this article i.e. PCA, I would like to briefly touch base what "Dimension Reduction" technique means! As the name speaks for itself, it is essentially process of converting higher dimensional data into lower dimensional data such that it can dispatch more information with use of less and important resources. These methods are usually brought into forefront so that machine learning problems like Regression, Classification can get better features.
One of the widely used methods of dimension reduction techniques is Principal Component Analysis also known as "PCA". It is a black box which is widely used but poorly understood. The mathematical term looks a bit intimidating but I tried to understand and make it helpful to non-math scholars. As it says on the tin, it does exactly the same; it finds the principal components of the data. To annotate it more clearly, it basically strips off the redundant parts of the data keeping the vital components. Lets understand this perception with a most commonly used dataset - "iris".
Let's delve into the data!
First few observations of the data:
Looking at the class type of all variables:
Looking at the total number of Species for each group:
Applying log to all the continuous variables so that we can get some pattern and make the data more interpretative.
Plotting these variables with log transformation, we get some really fantastic patterns:
Here, the best correlation is found between Petal Width and Petal Length.
Scale the variables: It is necessary to normalize the data in PCA because the motto of performing this exercise i.e. PCA is to find the components which show maximum variance. If we do not normalize we might get our data like the below figure:
From the figure it seems that only component explains all the variance in the data which is ideally impossible. Thus, it becomes quintessential to scale the data and bypass such situations.
Here:
log.iris -> logarithmic values of all continuous variables which are Sepal Length, Sepal Width, Petal Length, Petal Width.
Formula : scale = log.iris-mean/standard deviation
Running Singular Value Decomposition(SVD):
As seen above, after running SVD our original matrix is divided into three unique matrices - "D","U" & "V".
We have changed our original data in terms of eigenvectors. This will reorient the data in the direction where the data is having maximum variance.
Plotting these values in a plot to see the proportion:
A scree plot shown above and below helps us visualizing the variance explained by each PC. We can decide the number of PCs we would like to keep based on these scree plots.
It would be interesting to see the cummulative variance explained by first k principal components:
The number of principal components we'll retain will depend on the specific application. For the PCA we performed here on iris data set, 2 PCs look to suffice for the visualization purposes.
Thus, PCA not only gives us the ability to visualize the data in lower dimension but also reduce the computational time of some numerical algorithms.
Interestingly explained !