To Be or Not To Be a Feature
In many prediction problems, you may be overwhelmed with many input or feature variables. Many of the feature variables may be irrelevant for the prediction. There are many techniques to select the relevant set of features among all. This kind of pre processing work before building a machine learning model is known as feature engineering.
Proper feature selection is a critical step towards building a good learning model. In this post, my focus will be on feature selection by assigning scores based on various statistical measure for each feature variable.
Feature Reduction
In feature reduction, we extract m dimension out of n dimensions, where m is less than n. This is generally accomplished with a technique called Principal Component Analysis (PCA).
In PCA, a new set of variables are derived as a function of the original set of feature variables. The new set of variables are highly uncorrelated to each other.
For prediction problems, PCA is not very effective, because it does not take into account the output or class variable.
Feature Selection
In feature selection, we select a sub set of the original feature set. One brute force way to select feature subset is to try all possible subsets with the learning algorithm and choose the one that causes minimum error. Intuitively, a good feature variable should have the following characteristics.
A good feature variable will be highly uncorrelated with other feature variables and highly correlated with the output variable.
The procedure for feature sub set selection is as follows. All features are assigned scores using statistical measures based on entropy and mutual information. The features are ranked by the score and the top k features are selected.
Entropy is a measure of randomness of a variable. Mutual information is a measure of the mutual dependence between two variables.
In my OSS project avenir, I have a Hadoop based implementation of five statistical measures based on entropy and mutual information. They are as follows.
- Mutual Information Maximization (MIM)
- Mutual Information Feature Selection (MIFS)
- Joint Mutual Information (JMI)
- Double Input Symmetrical Response (DISR)
- Min Redundancy Max relevance (MRMR)
Details of these techniques can be found in my post. In my post, I have used hospital readmission as a classification use case with 10 feature variables.
It's not easy to decide which among these techniques will work best. One option is to select top k features by each of these techniques and run the learning algorithm on all of them. The feature sub set selection technique giving rise to minimum error should be the one selected.
Some Examples
Here are some examples along with heuristics to apply to decide whether a feature variable should retained for building the prediction model
- Has very little variance : Discard it. It's correlation with the output variable will be weak
- Has strong correlation with output variable and weak correlation with other feature variables : Retain it
- Two variables are strongly correlated with each other and strongly correlated with the output variable: Retain one and discard the other
Finally
Even if you are not doing feature analysis for building a learning model, using these techniques will give you valuable insights into the feature variables of a problem.
Thank you, Pranab Ghosh. As always, your articles are illuminating and your headings are entertaining. I will look forward to a future article of yours in which you will write about other methods increasingly used nowadays to simultaneously derive good features and train a classifier.