Day 8: MLOps Training
1)Feature Selection in MLOps
We have studied the Filter Method of Feature Selection that uses Correlation and Variance Threshold ways to detect the Features.Now comes the EMBEDDED METHOD .
Embedded Method: If we use the coefficient concept of Feature Selection ,then it is a part of Embedded Method.Coefficients are the constants that decide the inclination of the prediction line in Linear Rgression("c" in y=b+cx).Embedded Method is required to use because Correlation is quite inappropriate in many cases.
This method Provides a higher accuracy relative to corr() .However,This method is slower than corr() because coefficients can be detected only after training the Dataset .
Rather than Training the Dataset over the Algorithm to get Coefficients ,we can use a relatively faster Embedded method called Lasso method / L1 Regularization method.
Lasso method trains the dataset over over a Feature Coefficients prediction algorithm.
>>>from sklearn.linear_model import Lasso '''importing the Lasso model'''
>>>from sklearn.feature_selection import SelectFromModel '''helps to select features from Lasso model
>>>select=SelectFromModel(Lasso()) '''creating a model for feature selection'''
>>>select.fit (x_train , y_train)
>>>select.get_support() '''gives the boolean array which says which feature is predictor and which not'''
2)Feature Engineering
Feature engineering is a subset of Data Science just like Machine Learning but not a part of Machine Learning. It is done before creating ML models or before implementing ML over the Data. In other words, Feature Engineering is a pre-processing performed over data by transforming it into a form that would make our ML model more effective and we can have better insights of Data. One of the Feature Engineering is Encoding.
Encoding is the way of tranformation of string values into integers in a particular feature/variable. It is required because if a Feature is required for prediction in Machine Learning, then it shout contain integer values in it.This is also known as Variable Encoding/Label Encoding. One of them is One Hot Encoding.
One-Hot Encoding : When we have categorial variables in our data (like gender , semester number etc) we use One-Hot encoding. It is the process of converting the Categorical variable values into seperate variables. (for ex. : if Gender variable has Male and Female values , one-hot encodes the Male and Female into seperate variables and Gender is then Removed.)
One-Hot encoding presents one issue in its process called the Dummy Variable Trap.
Dummy Variable Trap : There comes an issue in one-hot encoding that x1 variable becomes dependent on x2 variable and the x2 variable depends on that variable x1 in case of 2 variables. This is known as Multi-Collinearity where x1,x2 are duplicate variables. Due to this , during computations our model gets confused or spends high computation power in this cycle of correlation and different results can be received everytime.
To remove the issue of Dummy var. trap ,it is required to remove multi-collinearity and for that we need to remove one of the Redundant variables aur duplicate variables.