Feature Engineering for ML

Anand Peri

Published Jun 19, 2023

Introduction:

Feature Engineering is the process of selecting, extracting features (embedded in other features), combining features to make a higher order feature, and finally converting the data a format machine can understand. This article covers Feature Engineering with structured/tabular data for classification, regression, clustering types of problems. If you are working on unstructured data like documents, images/video the process can be a little different.

Feature Engineer phase begins after data exploration; during data exploration phase you get familiarized with the dataset and relationships between the features and data quality issues. Feature Engineering is an iterative process; it is essential to cycle through the process and test your models and learn from it and make incremental changes until an optimal solution is achieved.

The goal of Feature Engineering is removing noise, standardizing the data, and choosing right features to predict the target label. Feature Engineering also consists of Feature Extraction which at its simplest form is splitting an unstructured text field into respective columns or extracting some key information critical to your model from another feature or consolidating the fields. Below are a few simple examples.

Splitting Name field into Last, First and Title; assuming Title or Last Name has an impact on the target prediction. However, If Name is not important then we can choose to drop it.
Extracting Week of the Month, Day of the Year, Day of the Week etc from the date field.
Ensure all data confirms to the same units of measure like USD or EUR, Liters or Gallons etc
Dropping the unique key if its just a sequence of numbers that has no special meaning. Another example would be dropping the Country, State & City when you have Zip Code or concatenating them all into a single field.

Feature Engineering should NOT be focused on identifying patterns in data and providing hints to the models. Below are some examples of things to keep in mind.

Calculating the mean, median or mode for individual or group of features; avoid introducing such hints and let the algorithm find the patterns and learn from the data.
Not all models require data to be scaled, for example most of the ensemble-based classification algorithms can handle unscaled data. First try to run the model without scaling and compare it with scaled data. If there is not much difference stay with unscaled data.
Be consistent with scaling or normalizing; apply the same logic to all else bias will creep in.

Recommended by LinkedIn

Feature Engineering: Turning Raw Data into Gold

Anmol Nayak 1 year ago

Feature Engineering Best Practices A Guide for Data…

EkasCloud London 1 year ago

Feature Engineering

Dr. John Martin 2 years ago

Feature Engineering Process:

Identify the target or the label that you plan to predict and determine the type of problem you are solving classification, regression, time series etc.
Determine the evaluation metric Identify the metric(s) on which the model performance will be evaluated (r2, rmse, precision, recall, f1 score etc). You can learn more about metrics here.
Duplicates: Check if there are duplicate records in the dataset and remove them. Consider dropping unique columns (primary keys) that have no impact on the target/label. If you drop records; reset the data frames index.
Missing Values: Check for missing values in each feature. When most of the data is missing, consider dropping that feature instead of imputing values and introducing bias.You could also build another model to impute the missing values if you choose to go down that path.
Feature Selection: Identify categorical features i.e features with text values. Group the features into Nominal, Ordinal and Binary. If you need more details about the feature types refer to my article here.

Nominal features do not have any ranking or hierarchy and hence require special encoding like One Hot, Hash, Binary, Frequency etc. It’s critical you choose the right type of encoding.
Ordinal features need to be ranked and assigned a numeric value based on their hierarchy and order of importance. Binary features are very simple, you need to convert them to 0 or 1.

Train Test Split:

Its time to generate subsets of data that can be processed by the models. You have to first shuffle and separate the features and target into (X,y) data frames.

Shuffle the dataset except for time series-based problems. It is important to shuffle the data to ensure model does not learn anything from the ordering. You can use a random seed to ensure repeatability.
Separate the data vertically into independent variables (X) and the dependent variable also called target or label (Y). You are trying to predict Y based on X.
Split the data horizontally using test/train split method into 70/30 ratio with a random seed for training your model and testing your model. Random seed can be used to ensure repeatability.

Now you are ready to train and test your models and iterate over the feature engineering and model tuning. In my next article will share some insights on model selection and tuning.

Conclusion:

Feature Engineering should preserve as much information in its raw format as possible while removing noise, outliers, and irrelevant data like primary keys. As a Feature Engineer you should NOT introduce bias into the dataset with new derived features or handling features differently when possible. Have a plan and iterate over it quickly till you find the optimal methods to impute and encode the features.

The goal of this article is to summarize the tasks of feature engineering and help you come up with a plan that can be executed consistently across projects. The information shared in this article is based on my experiences and would love to hear others’ experiences and observations.

To view or add a comment, sign in

Feature Engineering for ML

Anand Peri

Introduction:

Recommended by LinkedIn

Feature Engineering Process:

Train Test Split:

Conclusion:

More articles by Anand Peri

Others also viewed

Unlocking Hidden Value: Feature Engineering Techniques for Time-Series and Text Data

Feature Engineering in Data Science: An Essential Guide

A Beginner's Guide: How to Check if Data is Normal Before Training a Machine Learning Model in Exploratory Data Analysis (EDA)

Feature Engineering — The Real Power Behind High-Performing Machine Learning Models

Linear Regression With Bootstrapping

Mastering Cross-Validation: Enhancing Model Performance in Data Science

Final Preprocessing: Creating a Model-Ready Dataset for PD Modelling

Data Cleaning: The Most Underrated Skill in Data Science

Compositional Score Modeling: Engineering the Data Spectrum

Data Science Knowledge Sharing Session: 16

Explore content categories

Introduction:

Recommended by LinkedIn

Feature Engineering Process:

Train Test Split:

Conclusion:

More articles by Anand Peri

Decoding the AI Hype

Snowflake in 10 Minutes

Data Domains

Data Governance

Handling Outliers in AI/ML

Auto ML with EDA

Data Insights & Auto ML

ML Metrics & Illusions

One Hot Data Explosion

Others also viewed

Unlocking Hidden Value: Feature Engineering Techniques for Time-Series and Text Data

Feature Engineering in Data Science: An Essential Guide

A Beginner's Guide: How to Check if Data is Normal Before Training a Machine Learning Model in Exploratory Data Analysis (EDA)

Feature Engineering — The Real Power Behind High-Performing Machine Learning Models

Linear Regression With Bootstrapping

Mastering Cross-Validation: Enhancing Model Performance in Data Science

Final Preprocessing: Creating a Model-Ready Dataset for PD Modelling

Data Cleaning: The Most Underrated Skill in Data Science

Compositional Score Modeling: Engineering the Data Spectrum

Data Science Knowledge Sharing Session: 16

Similar topics

The Role Of Feature Engineering In Predictive Analytics

Explore content categories