Feature Engineering for ML
Goal is the bulls eye but if keep staring at the data you get a head ache

Feature Engineering for ML

Introduction:

Feature Engineering is the process of selecting, extracting features (embedded in other features), combining features to make a higher order feature, and finally converting the data a format machine can understand. This article covers Feature Engineering with structured/tabular data for classification, regression, clustering types of problems. If you are working on unstructured data like documents, images/video the process can be a little different.

Feature Engineer phase begins after data exploration; during data exploration phase you get familiarized with the dataset and relationships between the features and data quality issues. Feature Engineering is an iterative process; it is essential to cycle through the process and test your models and learn from it and make incremental changes until an optimal solution is achieved.

The goal of Feature Engineering is removing noise, standardizing the data, and choosing right features to predict the target label. Feature Engineering also consists of Feature Extraction which at its simplest form is splitting an unstructured text field into respective columns or extracting some key information critical to your model from another feature or consolidating the fields. Below are a few simple examples.

  • Splitting Name field into Last, First and Title; assuming Title or Last Name has an impact on the target prediction. However, If Name is not important then we can choose to drop it.
  • Extracting Week of the Month, Day of the Year, Day of the Week etc from the date field.
  • Ensure all data confirms to the same units of measure like USD or EUR, Liters or Gallons etc
  • Dropping the unique key if its just a sequence of numbers that has no special meaning. Another example would be dropping the Country, State & City when you have Zip Code or concatenating them all into a single field.

Feature Engineering should NOT be focused on identifying patterns in data and providing hints to the models. Below are some examples of things to keep in mind.

  • Calculating the mean, median or mode for individual or group of features; avoid introducing such hints and let the algorithm find the patterns and learn from the data.
  • Not all models require data to be scaled, for example most of the ensemble-based classification algorithms can handle unscaled data. First try to run the model without scaling and compare it with scaled data. If there is not much difference stay with unscaled data.
  • Be consistent with scaling or normalizing; apply the same logic to all else bias will creep in.


Feature Engineering Process:

  1. Identify the target or the label that you plan to predict and determine the type of problem you are solving classification, regression, time series etc.
  2. Determine the evaluation metric Identify the metric(s) on which the model performance will be evaluated (r2, rmse, precision, recall, f1 score etc). You can learn more about metrics here.
  3. Duplicates: Check if there are duplicate records in the dataset and remove them. Consider dropping unique columns (primary keys) that have no impact on the target/label. If you drop records; reset the data frames index.
  4. Missing Values: Check for missing values in each feature. When most of the data is missing, consider dropping that feature instead of imputing values and introducing bias.You could also build another model to impute the missing values if you choose to go down that path.
  5. Feature Selection: Identify categorical features i.e features with text values. Group the features into Nominal, Ordinal and Binary. If you need more details about the feature types refer to my article here.

  • Nominal features do not have any ranking or hierarchy and hence require special encoding like One Hot, Hash, Binary, Frequency etc. It’s critical you choose the right type of encoding.
  • Ordinal features need to be ranked and assigned a numeric value based on their hierarchy and order of importance. Binary features are very simple, you need to convert them to 0 or 1.

Train Test Split:

Its time to generate subsets of data that can be processed by the models. You have to first shuffle and separate the features and target into (X,y) data frames.

  1. Shuffle the dataset except for time series-based problems. It is important to shuffle the data to ensure model does not learn anything from the ordering. You can use a random seed to ensure repeatability.
  2. Separate the data vertically into independent variables (X) and the dependent variable also called target or label (Y). You are trying to predict Y based on X.
  3. Split the data horizontally using test/train split method into 70/30 ratio with a random seed for training your model and testing your model. Random seed can be used to ensure repeatability.

Now you are ready to train and test your models and iterate over the feature engineering and model tuning. In my next article will share some insights on model selection and tuning.

Conclusion:

Feature Engineering should preserve as much information in its raw format as possible while removing noise, outliers, and irrelevant data like primary keys. As a Feature Engineer you should NOT introduce bias into the dataset with new derived features or handling features differently when possible. Have a plan and iterate over it quickly till you find the optimal methods to impute and encode the features.

The goal of this article is to summarize the tasks of feature engineering and help you come up with a plan that can be executed consistently across projects. The information shared in this article is based on my experiences and would love to hear others’ experiences and observations.

To view or add a comment, sign in

More articles by Anand Peri

  • Decoding the AI Hype

    AI promises to redefine work, creativity, and even the definition of intelligence itself. But behind the headlines lies…

    2 Comments
  • Snowflake in 10 Minutes

    Decided to take the plunge and take the SnowPro Advanced Architect certification after having worked with snowflake for…

  • Data Domains

    Introduction With the popularity of Data Governance, Data Mesh gaining traction by the day one question that keeps…

  • Data Governance

    Introduction: The emergence of Generative AI and Machine learning put Data Governance squarely into limelight and…

  • Handling Outliers in AI/ML

    Introduction: Outliers are present in most datasets and should be handled with care for the machine learning models to…

  • Auto ML with EDA

    Summary: Machine Learning can unlock insights from the data that are not normally analyzed in traditional reporting and…

    1 Comment
  • Data Insights & Auto ML

    Summary: Machine learning at its core is quite simple/fun and gives you insights into the art of the possible when…

    1 Comment
  • ML Metrics & Illusions

    Introduction: In the world of machine learning accuracy is often seen as a reliable measure of model’s performance…

  • One Hot Data Explosion

    Context: Feature Engineering is a critical part of the Machine Learning process and consumes significant amount of…

Others also viewed

Explore content categories