What Dimensionality Reduction algorithm in ML and how we achieve this…

What Dimensionality Reduction algorithm in ML and how we achieve this…

As a beginner in ML, we scuffle to understand the concept and use of algorithms for Supervised Learning, Unsupervised Learning, Reinforcement Learning, Deep Learning etc. Sometimes it becomes more confusing when we find some algorithm around Dimensionality Reduction, Principal Component Analysis, Stochastic gradient descent, Ensemble Learning etc. Let us clearly understand that these all algorithms deals with features of the data during data analysis and data wrangling…In Machine Learning and Deep Learning there are multiple algorithms available for feature selection and feature extraction.

Let’s take a simple example of creating a model for patients to classify whether the patient has heart disease or not, based on the diagnosis dataset hospital has. For this problem we can use any supervised classification algorithm to create the classification model, train/test the model based on the available data, refine the model’s prediction by ensemble technique/algorithm and then start using the model for prediction. But the core work in Machine Learning is not just to use any model based on the requirement. The tangible work lies on analyse and understanding the data, cleaning the data, giving emphasis on feature selection as well as feature extraction, work on feature scaling etc., so that the data which the model will use for training itself for future prediction is near perfect data. (In ML features means the independent variables which will be used for prediction)

So, with this understanding let us understand what Dimensionality Reduction is and when we use these algorithms in ML. We will also glance through what are the different types of dimensionality reduction technique and algorithm available and how to use them. 

What is Dimensionality Reduction?

As per Wikipedia, "In statistics, machine learning, and information theory, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. Approaches can be divided into feature selection and feature extraction."

What does this means??

In machine learning classification problems, there are often too many factors on the basis of which the final classification is done. These factors are basically variables called features. The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality reduction algorithms come into play. Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It can be divided into feature selection and feature extraction. An impulsive instance of dimensionality reduction can be understood through a simple e-mail classification problem, where we need to classify whether the e-mail is spam or ham(not spam). This can involve a large number of features, such as whether or not the e-mail has a generic title, what type the content the e-mail has, whether the e-mail uses a template, etc. However, some of these features may overlap. In another classification problem that relies on both humidity and rainfall can be buckled into just one underlying feature, since both of the aforesaid are correlated to a high degree. Hence, we can reduce the number of features in such problems. 

When you have a dataset with more than three dimensions, it becomes impossible to see what’s going on with our eyes. A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2 dimensional space, and a 1-D problem to a simple line. The main objective of dimensionality reduction is to find a low-dimensional representation of the data that retains as much information as possible.

Let me explain in a different way for simplicity...

In machine learning we are having too many factors on which the final classification is done. These factors are basically, known as variables/ features. The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality reduction algorithms come into play.

Hope the concept and purpose of dimensionality reduction is clear now :)

There are two we ways we accomplish dimensionality reduction:

1.      Feature selection: In this, we try to find a subset of the original set of variables, or features, to get a smaller subset which can be used to model the problem. It usually involves three ways:

a.      Filter

b.     Wrapper

c.      Embedded

2.      Feature extraction: This reduces the data in a high dimensional space to a lower dimension space, i.e. a space with lesser no. of dimensions.

In ML people uses various methods for dimensionality reduction. I will try to list most of them and explain the use of it, but in this article I will not go in details for any of these just to make sure the article is not too big to read. In the subsequent article I will try to do a deep drive for few of the most popular dimensionality reduction technique like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) and Generalized Discriminant Analysis (GDA).

Before I go to discuss on the ways we can perform, let me touch upon “The Curse of Dimensionality” concept…

No alt text provided for this image

You can see in the above diagram with the increase in number of features in the dataset, the performance of the model decreases after certain optimal number of features. It’s very simple to understand that as the number of features increases, the model becomes more complex. The more the number of features, the more the chances of overfitting. A machine learning model that is trained on a large number of features, gets increasingly dependent on the data it was trained on and in turn overfitted, resulting in poor performance on real data, beating the purpose.

Avoiding overfitting is a major motivation for performing dimensionality reduction. The fewer features our training data has, the lesser assumptions our model makes and the simpler it will be. But that is not all and dimensionality reduction has a lot more advantages to offer, like

·      It helps in data compression, and hence reduced storage space.

·      It reduces computation time.

·      It also helps remove redundant features, if any.

·      Improves model accuracy with less misleading data.

·      Use less computing with lesser dimensions and with less data algorithms gets trained faster.

Let me discuss the understanding on few common ways to perform Dimension Reduction in dataset:

1. Missing Values: While exploring data, if we encounter missing values, what we do? Our first step should be to identify the reason then impute missing values/ drop variables using appropriate methods. But, what if we have too many missing values? Should we impute missing values or drop the variables? I would prefer the latter, because it would not have lot more details about data set. Also, it would not help in improving the power of model. Next question, is there any threshold of missing values for dropping a variable? It varies from case to case. If the information contained in the variable is not that much, you can drop the variable if it has more than ~40-50% missing values.

2. Low Variance Filter: Let’s think of a scenario where we have a constant variable (all observations have same value, 5) in our data set. Do you think, it can improve the power of model? Ofcourse NOT, because it has zero variance. In case of high number of dimensions, we should drop variables having low variance compared to others because these variables will not explain the variation in target variables.

3. Decision Trees: It is one of my favourite techniques. It can be used as an ultimate solution to tackle multiple challenges like missing values, outliers and identifying significant variables. Several data scientists used decision tree and it worked well for them.

4. Random Forest: Similar to decision tree is Random Forest. In this multiple Decision tree algorithms are used and most predicted data is taken by voting. Just be careful that random forests have a tendency to bias towards variables that have more no. of distinct values i.e. favour numeric variables over binary/categorical values.

5. High Correlation Filter: Dimensions exhibiting higher correlation can lower down the performance of model. Moreover, it is not good to have multiple variables of similar information or variation also known as “Multicollinearity”. You can use continuous variables or discrete variables correlation matrix to identify the variables with high correlation and select one of them using Variance Inflation Factor. Variables having higher value ( VIF > 5 ) can be dropped.

6. Backward Feature Elimination: In this method, we start with all n dimensions. Compute the sum of square of error (SSR) after eliminating each variable (n times). Then, identifying variables whose removal has produced the smallest increase in the SSR and removing it finally, leaving us with n-1 input features. We need to repeat this process until no other variables can be dropped.

7. Forward Feature Selection: This is reverse to Backward Feature Elimination. In this method, we select one variable and analyse the performance of model by adding another variable. Here, selection of variable is based on higher improvement in model performance.

8. Factor Analysis: When some variables are highly correlated, those can be grouped by their correlations i.e. all variables in a particular group can be highly correlated among themselves but have low correlation with variables of other group(s). Here each group represents a single underlying construct or factor. These factors are small in number as compared to large number of dimensions. However, these factors are difficult to observe. There are basically two methods of performing factor analysis:

·      EFA (Exploratory Factor Analysis)

·      CFA (Confirmatory Factor Analysis)

9. Principal Component Analysis (PCA): PCA is an unsupervised machine learning technique which creates a low dimensional representation of a dataset. PCA is used to mathematically transform the variables in the datasets to form principal components which define the maximal variance in the data. These principal components are often a linear combination of the variables and are mutually uncorrelated. So, if you have a dataset with hundreds of variables and many of them are correlated, PCA can turn that dataset to have fewer variables with the highest variance and no multi colinearity between them.

Apart from these, few more techniques are also used for dimensionality reduction, such as:

10. Linear Discriminant Analysis: LDA is a very common technique used for supervised classification problems:

11. Generalized Discriminant Analysis: GDA is used to deal with nonlinear discriminant analysis using kernel function operator.

12. Independent Component Analysis: We can use ICA to transform the data into independent components which describe the data using less number of components

13. ISOMAP: We use this technique when the data is strongly non-linear

14. t- Distributed Stochastic Neighbour Embedding (t-SNE): This technique also works well when the data is strongly non-linear. It works extremely well for visualizations as well

15. UMAP: This technique works well for high dimensional data. Its run-time is shorter as compared to t-SNE

Before I conclude this article let’s understand some disadvantages of Dimensionality Reduction also…

It is obvious that Dimensionality Reduction may lead to some amount of data loss. We may not know how many principal components to keep- in practice though some thumb rules are applied based on some general assumption. 

In this article, we looked at the simplified version of Dimension Reduction covering its importance, benefits, the commonly methods and the preference as to when to choose a particular technique. In future post, I will write about the PCA in more detail.

Very nice article based on graph theory.

Like
Reply

To view or add a comment, sign in

More articles by Subhasish Bhattacharjee

Others also viewed

Explore content categories