PRINCIPAL COMPONENT ANALYSIS - Simplifying Data with PCA
Hi All!
I'm Saurav Kumar, a Master's student in Data Science at the University of Massachusetts and an experienced Machine Learning Engineer. In today's world of data-driven decision-making, it’s essential to work with datasets that are both manageable and insightful. That's where Principal Component Analysis (PCA) comes into play!
If you've been grappling with high-dimensional data and are looking for a way to simplify it while still retaining its most important features, PCA might just be the solution you've been searching for.
LET'S DIVE IN!
WHAT IS PCA?
Principal Component Analysis (PCA) is a dimensionality reduction technique. It helps transform your dataset by identifying the directions (principal components) where the data varies the most and projecting it onto a smaller subspace, without losing too much information. This allows us to reduce the number of variables (dimensions) while still capturing the most significant patterns in the data.
By doing this, PCA makes it easier to visualize complex datasets, speeds up algorithms, and can even help remove noise.
WHY USE PCA?
Dimensionality reduction is often needed because:
LET'S PERFORM THE ACTIVITY AND UNDERSTAND THE IMPLEMENTATION
We will be using the most famous dataset in the field of Machine Learning i.e. IRISH DATASET
and the language will be R Language which I consider to be the best for data analysis.
LOADING DATASET
data("iris")
str(iris)
summary(iris)
There are 5 variables and 1st 4 are numeric and the last has a factor variable with 3 different levels. Refer Fig-1
Split the dataset into training and testing
set.seed(111)
ind <- sample(2, nrow(iris),
replace = TRUE,
prob = c(0.8,0.2))
training_data <- iris[ind==1,]
testing_data <- iris[ind==2,]
SCATTER PLOT AND CORRELATION COEFFICIENT
This plot will help us find the correlation between each other. Here we will be using training data to analyse the plot.
library(psych)
pairs.panels(training_data[,-5],
gap=0,
bg= c("red","yellow", "blue")[training_data$Species],
pch=21)
This plot displays a scatter plot in the lower triangle and correlation coefficients in the upper triangle. The correlation is high between sepal length and petal width, as well as between sepal length and petal length. A high correlation between independent variables can lead to multicollinearity, where the scatter plot shows the strongest correlation between petal length and width, and the weakest between sepal length and width. Multicollinearity can result in unstable predictions, which is where PCA (Principal Component Analysis) becomes useful to address the issue. Refer Fig-2
PRINCIPAL COMPONENT ANALYSIS
To understand how our variables contribute to the underlying patterns in the data, we conducted a Principal Component Analysis (PCA). PCA helps reduce dimensionality by transforming the original variables into new uncorrelated variables, called principal components while retaining most of the variability in the data.
Centering and Scaling: The variables were first centered (i.e., the mean of each variable was set to zero) and then scaled (standardized to have unit variance). This ensures that all variables contribute equally to the analysis, regardless of their original scale or units.
pc <- prcomp(training_data[,-5],
center = TRUE,
scale. = TRUE)
Principal Components: The PCA produced four principal components (PCs), which are linear combinations of the original variables. The rotation matrix provides insight into how each variable contributes to the principal components. For example, in the first principal component (PC1), the Sepal Length, Petal Length, and Petal Width all increase, while Sepal Width decreases. This indicates that PC1 is heavily influenced by Sepal Length and Petal dimensions. On the other hand, PC2 is primarily characterized by Sepal Width. Refer Fig-3
Proportion of Variance: The Proportion of Variance reveals that PC1 captures the largest portion of variability in the dataset, accounting for approximately 73% of the total variance. This suggests that much of the variability in the data can be explained by a single component. Meanwhile, PC3 and PC4 do not contribute significantly to the variability, meaning their impact on explaining the patterns is minimal. Refer Fig-4
Implications: Since PC1 captures the majority of the variation, it simplifies the data by reducing the dimensions while preserving most of the relevant information. This is particularly useful for addressing multicollinearity and improving the stability of predictive models.
Recommended by LinkedIn
ORTHOGONALITY OF PRINCIPAL COMPONENT
It refers to the fact that each principal component (PC) in Principal Component Analysis (PCA) is perpendicular (or orthogonal) to the others. This means that the principal components are uncorrelated with each other, capturing distinct and independent directions of variance in the dataset.
pairs.panels(pc$x,
gap=0,
bg= c("red","yellow", "blue")[training_data$Species],
pch=21)
The scatter plot matrix Refer Fig-5 visually confirms this property by showing that the relationships between the PCs are uncorrelated, as indicated by the absence of discernible patterns between pairs of PCs. Each scatter plot demonstrates that the PCs are linearly independent of one another, with a correlation near zero. This is significant because the PCs, being orthogonal, help address multicollinearity in the original dataset. By removing correlations between variables, PCA ensures that each principal component contributes uniquely to the analysis, reducing the risk of unstable or unreliable predictions in subsequent models.
BI-PLOT
The bi-plot is a powerful visualization tool in Principal Component Analysis (PCA) that displays both the principal components and the contribution of the original variables in the same plot.
library(devtools)
library(ggbiplot)
g <- ggbiplot(pc,
obs.scale = 1,
var.scale = 1,
groups = training_data$Species,
ellipse = TRUE,
circle = TRUE,
ellipse.prob = 0.68)
g <- g+ scale_color_discrete(name='')
g <- g+ theme(legend.direction ='horizontal',
legend.position = 'top')
print(g)
The bi-plot Refer Fig-6 shows PC1 on the x-axis and PC2 on the y-axis, with species groups color-coded and ellipses capturing 68% of the data points. The arrows represent the original variables, whereas closer arrows indicate higher correlation. Variables on the right are positively correlated with PC1, and those on the left are negatively correlated. This aligns with the earlier findings from the rotation matrix.
PREDICTION WITH PRINCIPAL COMPONENT
Transforming training and testing datasets into the principal component space, while retaining their respective labels. This step is crucial for applying machine learning models on the reduced feature set (principal components).
trg <- predict(pc, training_data)
trg <- data.frame(trg, training_data[5])
print(trg)
tst <- predict(pc, testing_data)
tst <- data.frame(tst, testing_data[5])
print(tst)
The test score will be as in Fig-7
MULTINOMIAL LOGISTIC REGRESSION with 2 PCs
In this section, we apply Multinomial Logistic Regression using the principal components (PC1 and PC2) to predict the species of iris flowers. By first releveling the Species variable with "setosa" as the reference, we ensure that all other species are compared to it. Using only PC1 and PC2 as predictors is effective because these two components capture the majority of the data's variance, simplifying the model while still accounting for key patterns.
library(nnet)
trg$Species <- relevel(trg$Species, ref = "setosa")
mymodel <- multinom(Species~PC1+PC2, data=trg)
summary(mymodel)
This helps mitigate multicollinearity issues and allows for a clearer interpretation of how the principal components relate to species classification. The result is a more robust model that leverages the reduced dimensions from PCA.
CONFUSION MATRIX AND MISCLASSIFICATION ERROR ON TRAINING DATA
p <- predict(mymodel, trg)
tab <- table(p, trg$Species)
tab
#Calculating misclassification error
1- sum(diag(tab))/sum(tab)
In this section, we evaluate the performance of our Multinomial Logistic Regression model by constructing a confusion matrix and calculating the misclassification error on the training data. The confusion matrix reveals the model's predictions compared to the actual species classifications. Specifically, it correctly classifies 45 instances of "setosa," 35 instances of "versicolor," and 32 instances of "virginica." However, it does show some misclassifications, with three instances of "versicolor" incorrectly predicted as "virginica" and five instances of "virginica" incorrectly predicted as "versicolor." Refer Fig-8
The overall misclassification error is calculated to be approximately 6.67%, indicating that the model performs well but still has some room for improvement. This analysis demonstrates the effectiveness of using principal components for classification while highlighting areas where the model may need refinement to reduce misclassifications further.
CONFUSION MATRIX AND MISCLASSIFICATION ERROR ON TESTING DATA
p1 <- predict(mymodel, tst)
tab1 <- table(p1, tst$Species)
tab1
1- sum(diag(tab1))/sum(tab1)
In the confusion matrix for the testing data, the model correctly classifies 5 instances of "setosa," 9 instances of "versicolor," and 12 instances of "virginica," but misclassifies 3 instances of "virginica" as "versicolor" and 1 instance of "versicolor" as "virginica." The overall misclassification error is approximately 13.33%, indicating a moderate level of classification accuracy on the testing set.
ADVANTAGE
DISADVANTAGE
CONCLUSION
Principal Component Analysis (PCA) is a robust technique for dimensionality reduction that simplifies the analysis of high-dimensional data while preserving essential information. By transforming correlated variables into a set of uncorrelated principal components, PCA not only addresses multicollinearity issues but also facilitates the visualization of complex datasets through biplots. This article has illustrated the application of PCA in understanding and interpreting relationships between variables, particularly in the context of a prediction model. The insights gained through PCA can significantly enhance model performance by reducing noise and improving interpretability.
REFERENCE
I hope this article has offered valuable insights into the application of PCA for data analysis. If you have any suggestions, questions, or require further clarification on implementing PCA, please feel free to reach out. I'm here to help and discuss this topic further!