Feature Engineering for ML
Introduction:
Feature Engineering is the process of selecting, extracting features (embedded in other features), combining features to make a higher order feature, and finally converting the data a format machine can understand. This article covers Feature Engineering with structured/tabular data for classification, regression, clustering types of problems. If you are working on unstructured data like documents, images/video the process can be a little different.
Feature Engineer phase begins after data exploration; during data exploration phase you get familiarized with the dataset and relationships between the features and data quality issues. Feature Engineering is an iterative process; it is essential to cycle through the process and test your models and learn from it and make incremental changes until an optimal solution is achieved.
The goal of Feature Engineering is removing noise, standardizing the data, and choosing right features to predict the target label. Feature Engineering also consists of Feature Extraction which at its simplest form is splitting an unstructured text field into respective columns or extracting some key information critical to your model from another feature or consolidating the fields. Below are a few simple examples.
Feature Engineering should NOT be focused on identifying patterns in data and providing hints to the models. Below are some examples of things to keep in mind.
Recommended by LinkedIn
Feature Engineering Process:
Train Test Split:
Its time to generate subsets of data that can be processed by the models. You have to first shuffle and separate the features and target into (X,y) data frames.
Now you are ready to train and test your models and iterate over the feature engineering and model tuning. In my next article will share some insights on model selection and tuning.
Conclusion:
Feature Engineering should preserve as much information in its raw format as possible while removing noise, outliers, and irrelevant data like primary keys. As a Feature Engineer you should NOT introduce bias into the dataset with new derived features or handling features differently when possible. Have a plan and iterate over it quickly till you find the optimal methods to impute and encode the features.
The goal of this article is to summarize the tasks of feature engineering and help you come up with a plan that can be executed consistently across projects. The information shared in this article is based on my experiences and would love to hear others’ experiences and observations.