Chi Square Feature Selection in Python
Chi-Square Test:
The Chi-Square test of independence is a statistical test to determine if there is a significant relationship between 2 categorical variables. In simple words, the Chi-Square statistic will test whether there is a significant difference in the observed vs the expected frequencies of both variables.
The Chi-Square statistic is calculated as follows:
The Null hypothesis is that there is NO association between both variables.
The Alternate hypothesis says there is evidence to suggest there is an association between the two variables.
In our case, we will use the Chi-Square test to find which variables have an association with the Survived variable. If we reject the null hypothesis, it's an important variable to use in your model.
To reject the null hypothesis, the calculated P-Value needs to be below a defined threshold. Say, if we use an alpha of .05, if the p-value < 0.05 we reject the null hypothesis. If that’s the case, you should consider using the variable in your model.
Rules to use the Chi-Square Test:
1. Variables are Categorical
2. Frequency is at least 5
3. Variables are sampled independently
Chi-Square Test in Python
We will now be implementing this test in an easy to use python class we will call ChiSquare. Our class initialization requires a panda’s data frame which will contain the dataset to be used for testing. The Chi-Square test provides important variables such as the P-Value mentioned previously, the Chi-Square statistic and the degrees of freedom.Won’t have to implement the show functions as we will use the scipy implementation for this.
Let’s begin creating our class.
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency
class ChiSquare:
def __init__(self, dataframe):
self.df = dataframe
self.p = None #P-Value
self.chi2 = None #Chi Test Statistic
self.dof = None
self.dfObserved = None
self.dfExpected = None
Next, we define our function called _print_chisquare_result which will accept as an input the name of column X and the alpha value.Alpha is the threshold that will be used to determine if to reject or accept the null hypothesis of the Chi-Square test of independence. This function will print if the variable X is important or if not. In the code, it’s comparing the p-value (which we will implement next) against this threshold.
def _print_chisquare_result(self, colX, alpha):
result = ""
if self.p<alpha:
result="{0} is IMPORTANT for Prediction".format(colX)
else:
result="{0} is NOT an important predictor. (Discard {0} from model)".format(colX)
print(result)
Now we implement the actual logic to performing the Chi-Square test using scipy in our new function called TestIndependence. This function accepts two column names, colX, and colY we are the two variables being compared. When using this class, colY is your objective, the variable you are trying to predict, Survived in our dataset. ColX is the feature you are testing against. The last variable is Alpha which we default to 0.05.
First, we convert our colX and colY to string types. Remember the Chi-Square test requires categorical variables.
To calculate our frequency counts we will be using the panda's crosstab function. The observed and expected frequencies will be stored in the dfObserved and dfExpected data frames as they are calculated.
Finally, we use the scipy function chi2_contingency to calculate the Chi-Statistic, P-Value, Degrees of Freedom and the expected frequencies. One line for all the functions mentioned in the chi-square test section! We then simply store them in our class variables.
The last step is we call the _print_chisquare_result that performs the logic previously defined and tells the result of the test for our feature selection.
def TestIndependence(self,colX,colY, alpha=0.05):
X = self.df[colX].astype(str)
Y = self.df[colY].astype(str)
self.dfObserved = pd.crosstab(Y,X)
chi2, p, dof, expected = stats.chi2_contingency(self.dfObserved.values)
self.p = p
self.chi2 = chi2
self.dof = dof
self.dfExpected = pd.DataFrame(expected, columns=self.dfObserved.columns, index = self.dfObserved.index)
self._print_chisquare_result(colX,alpha)
Chi-Square Feature Selection in Python
We are now ready to use the Chi-Square test for feature selection using our ChiSquare class. Let’s now import the dataset. The second line below adds a dummy variable using numpy that we will use for testing if our ChiSquare class can determine this variable is not important. This dummy variable has equal chances of being a 1 or 0 in each row.
df = pd.read_csv('data/Update (1).csv')
df['dummyCat'] = np.random.choice([0, 1], size=(len(df),), p=[0.5, 0.5])
df.head()
Let’s now initialize our ChiSquare class and we will loop through multiple columns to run the chi-square test for each of them against our Survived variable. The class then prints if the feature is an important feature of your machine learning model. You’ll notice among the not important ones is our dummyCat variable.
#Initialize ChiSquare Class
cT = ChiSquare(df)
#Feature Selection
testColumns = ['patient name','Gender','Age','Contact number','Address','dummyCat']
for var in testColumns:
cT.TestIndependence(colX=var,colY="Unit" )
DataSet:
Another Chi-Square Feature Selection Way:
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#Load Data
# Load iris data
iris = load_iris()
# Create features and target
X = iris.data
y = iris.target
# Convert to categorical data by converting data to integers
X = X.astype(int)
#Compare Chi-Squared Statistics
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
X_kbest = chi2_selector.fit_transform(X, y)
# Show results
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])
For more information, Contact With Me.Visit My Facebook Page and Group.
Page Link: https://www.facebook.com/WebProgrammingAndDataMining
Group Link: https://www.facebook.com/groups/1072715656212787/
About Me: http://biplob.wordpressctg.com/