One Hot Data Explosion
One Hot Data Explostion

One Hot Data Explosion

Context:

Feature Engineering is a critical part of the Machine Learning process and consumes significant amount of time. Features in ML are analogous to the dimensions in a data warehouse which are also the critical building blocks of a good analytics platform but there ends the similarity.

One of the key differences between data warehousing/analytics and machine learning algorithms is that all data needs to be presented in a numeric format in machine learning algorithms while Data Warehouses can handle both string and numerical data.

Features can be broadly classified as Categorical or Non-Categorical based on their characteristics/ values. The focus of this article is on Categorical Features.

Categorical Features:

  • Nominal: Feature values have no inherent hierarchy or importance like the colors, cities, states etc. You can’t map these to numbers directly in machine learning as ML assumes the higher the number the feature has more significance.
  • Ordinal: Feature values have an inherent hierarchy and order of importance; education qualifications (High School, Bachelors ,Masters, PHD) , support ticket priority etc. These can be converted into numbers using label encoders.
  • Binary: There are only two unique values for these features, yes/no, male/female etc and they can be mapped to 0/1. Very simple to handle.

Non-Categorical Features:

  • Continuous: Feature values are continuous; numbers of years of experience, age, credit score etc; we can pass these directly into machine learning algorithms.
  • Discreet: Feature values are whole numbers and typically limited values; number of children, number of dependents, review rating on a scale of 1 to 5 etc. We can pass these directly into machine learning algorithms.
  • Date Time: This is self-explanatory contains date and time and temporal information such as day of the week, day of the month, day of the year, week of the month, week of the year etc. We can extract these additional features or choose to pass the date time values directly into machine learning algorithms.


Illustration:

One of the most popular methods of dealing with Nominal Features is One Hot Encoding. There are a few other encodings but most of the machine learning materials refer to One Hot Encoding; pandas also provides a special implementation of OHE called pd.get_dummies(). Both have the same challenges highlighted below which I'm calling as "data explosion".

One Hot Encoding (OHE): OHE converts unique value of the feature into columns with values with “0” or “1”. If you have a column called Color with three values (Red, Blue, Green). Below is a visual example.

No alt text provided for this image

Problem Statement:

Real world datasets have many Nominal features like City, Occupation, Job Title, Patient Symptoms etc. and each can have 100’s of unique values.

If you use OHE data explosion can occur with thousands of columns blowing up your original dataset by many folds and it may not fit in the memory causing many other problems for the ML algorithms.

Below are few alternate encoding methods:

  • Binary Encoding: This method also adds additional columns leading to the data explosion but when compared to OHE the number of columns added is much less. The reason is binary encoding uses two bits for each column and based on that it can derive adjacent values.
  • Hash Encoding: This method also adds additional columns based on the hash space allocated; this can lead to multicollinearity issues if the hash space is small and if you increase the hash space you end up with the data explosion issue again. But overall, a better alternative to OHE.
  • Frequency Encoding: This method does not add any new columns but just calculates the frequency between the categorical variable and the target label. It’s a simple method that might work well if you have a very good representation of training dataset.
  • Target Encoding: This does not add any new columns to the dataset and hence is memory efficient and helps the algorithm run faster. This is also known as mean or probability encoding; this calculates a score between the categorical feature values and its corresponding target. This is often considered a target leak and may result in overfitting of the model.

Each of these encoders have their own pros and cons. I’m publishing the results observed using some of these encoders on a customer churn analysis against a publicly available Telco dataset on Kaggle with 7043 observations with 38 features. I split the dataset into 70/30 for training and testing. The same conversions, shuffle/splits and model configuration were used for all the tests on a kaggle notebook. The train dataset contained 4930 observations and the test contained 2113 observations.

No alt text provided for this image

One Hot Encoder can be tuned if you change the default data type from Int64 to Int8. Considering the data stored in these columns is just 0 or 1 its safe to assume Int8 would work fine. Also do not forget to drop one column after performing one hot encoding.

OHE + PCA the memory was calculated by adding the OHE memory consumption with PCA memory consumption as you must encode the data before using the PCA transformation.

Detailed Screen Shots for the above tests.

No alt text provided for this image

I’m researching how to improve performance of the ML models through feature engineering and one of the primary components is encoding. I have built a new encoding technique and have been testing it out on various datasets. I would love to connect with others in the community who are doing research in this area. Please connect with me via LinkedIn.

To view or add a comment, sign in

More articles by Anand Peri

  • Decoding the AI Hype

    AI promises to redefine work, creativity, and even the definition of intelligence itself. But behind the headlines lies…

    2 Comments
  • Snowflake in 10 Minutes

    Decided to take the plunge and take the SnowPro Advanced Architect certification after having worked with snowflake for…

  • Data Domains

    Introduction With the popularity of Data Governance, Data Mesh gaining traction by the day one question that keeps…

  • Data Governance

    Introduction: The emergence of Generative AI and Machine learning put Data Governance squarely into limelight and…

  • Handling Outliers in AI/ML

    Introduction: Outliers are present in most datasets and should be handled with care for the machine learning models to…

  • Auto ML with EDA

    Summary: Machine Learning can unlock insights from the data that are not normally analyzed in traditional reporting and…

    1 Comment
  • Data Insights & Auto ML

    Summary: Machine learning at its core is quite simple/fun and gives you insights into the art of the possible when…

    1 Comment
  • Feature Engineering for ML

    Introduction: Feature Engineering is the process of selecting, extracting features (embedded in other features)…

  • ML Metrics & Illusions

    Introduction: In the world of machine learning accuracy is often seen as a reliable measure of model’s performance…

Others also viewed

Explore content categories