One Hot Data Explosion

Anand Peri

Published Jun 13, 2023

Context:

Feature Engineering is a critical part of the Machine Learning process and consumes significant amount of time. Features in ML are analogous to the dimensions in a data warehouse which are also the critical building blocks of a good analytics platform but there ends the similarity.

One of the key differences between data warehousing/analytics and machine learning algorithms is that all data needs to be presented in a numeric format in machine learning algorithms while Data Warehouses can handle both string and numerical data.

Features can be broadly classified as Categorical or Non-Categorical based on their characteristics/ values. The focus of this article is on Categorical Features.

Categorical Features:

Nominal: Feature values have no inherent hierarchy or importance like the colors, cities, states etc. You can’t map these to numbers directly in machine learning as ML assumes the higher the number the feature has more significance.
Ordinal: Feature values have an inherent hierarchy and order of importance; education qualifications (High School, Bachelors ,Masters, PHD) , support ticket priority etc. These can be converted into numbers using label encoders.
Binary: There are only two unique values for these features, yes/no, male/female etc and they can be mapped to 0/1. Very simple to handle.

Non-Categorical Features:

Continuous: Feature values are continuous; numbers of years of experience, age, credit score etc; we can pass these directly into machine learning algorithms.
Discreet: Feature values are whole numbers and typically limited values; number of children, number of dependents, review rating on a scale of 1 to 5 etc. We can pass these directly into machine learning algorithms.
Date Time: This is self-explanatory contains date and time and temporal information such as day of the week, day of the month, day of the year, week of the month, week of the year etc. We can extract these additional features or choose to pass the date time values directly into machine learning algorithms.

Illustration:

One of the most popular methods of dealing with Nominal Features is One Hot Encoding. There are a few other encodings but most of the machine learning materials refer to One Hot Encoding; pandas also provides a special implementation of OHE called pd.get_dummies(). Both have the same challenges highlighted below which I'm calling as "data explosion".

One Hot Encoding (OHE): OHE converts unique value of the feature into columns with values with “0” or “1”. If you have a column called Color with three values (Red, Blue, Green). Below is a visual example.

Recommended by LinkedIn

Machine Learning in Practice

Neal Miller 8 years ago

Data Scientist’s Dilemma: The Cold Start Problem – Ten…

Kirk Borne, Ph.D. 7 years ago

Machine Learning products. How do we build them?

Martyn S. 7 years ago

Problem Statement:

Real world datasets have many Nominal features like City, Occupation, Job Title, Patient Symptoms etc. and each can have 100’s of unique values.

If you use OHE data explosion can occur with thousands of columns blowing up your original dataset by many folds and it may not fit in the memory causing many other problems for the ML algorithms.

Below are few alternate encoding methods:

Binary Encoding: This method also adds additional columns leading to the data explosion but when compared to OHE the number of columns added is much less. The reason is binary encoding uses two bits for each column and based on that it can derive adjacent values.
Hash Encoding: This method also adds additional columns based on the hash space allocated; this can lead to multicollinearity issues if the hash space is small and if you increase the hash space you end up with the data explosion issue again. But overall, a better alternative to OHE.
Frequency Encoding: This method does not add any new columns but just calculates the frequency between the categorical variable and the target label. It’s a simple method that might work well if you have a very good representation of training dataset.
Target Encoding: This does not add any new columns to the dataset and hence is memory efficient and helps the algorithm run faster. This is also known as mean or probability encoding; this calculates a score between the categorical feature values and its corresponding target. This is often considered a target leak and may result in overfitting of the model.

Each of these encoders have their own pros and cons. I’m publishing the results observed using some of these encoders on a customer churn analysis against a publicly available Telco dataset on Kaggle with 7043 observations with 38 features. I split the dataset into 70/30 for training and testing. The same conversions, shuffle/splits and model configuration were used for all the tests on a kaggle notebook. The train dataset contained 4930 observations and the test contained 2113 observations.

One Hot Encoder can be tuned if you change the default data type from Int64 to Int8. Considering the data stored in these columns is just 0 or 1 its safe to assume Int8 would work fine. Also do not forget to drop one column after performing one hot encoding.

OHE + PCA the memory was calculated by adding the OHE memory consumption with PCA memory consumption as you must encode the data before using the PCA transformation.

Detailed Screen Shots for the above tests.

I’m researching how to improve performance of the ML models through feature engineering and one of the primary components is encoding. I have built a new encoding technique and have been testing it out on various datasets. I would love to connect with others in the community who are doing research in this area. Please connect with me via LinkedIn.

To view or add a comment, sign in

One Hot Data Explosion

Anand Peri

Context:

Recommended by LinkedIn

Problem Statement:

More articles by Anand Peri

Others also viewed

Machine Algorithm Development Cycle: Part2 – ML Algorithms

Digits Data Classification using different Machine Learning models

December 25, 2020

How To Describe Data in Machine Learning?

Role of Entropy in Feature Engineering for Machine Learning (Classification Problems)

Understanding the K-Nearest Neighbors (KNN) Algorithm

Data Scaling and Training space in Machine Learning. A Statistical perspective.

Mathematics of Machine Learning: PCA

Essence of Machine Learning

Data Manipulation and Visualization Functions in Machine Learning with R

Explore content categories

Context:

Recommended by LinkedIn

Problem Statement:

More articles by Anand Peri

Decoding the AI Hype

Snowflake in 10 Minutes

Data Domains

Data Governance

Handling Outliers in AI/ML

Auto ML with EDA

Data Insights & Auto ML

Feature Engineering for ML

ML Metrics & Illusions

Others also viewed

Machine Algorithm Development Cycle: Part2 – ML Algorithms

Digits Data Classification using different Machine Learning models

December 25, 2020

How To Describe Data in Machine Learning?

Role of Entropy in Feature Engineering for Machine Learning (Classification Problems)

Understanding the K-Nearest Neighbors (KNN) Algorithm

Data Scaling and Training space in Machine Learning. A Statistical perspective.

Mathematics of Machine Learning: PCA

Essence of Machine Learning

Data Manipulation and Visualization Functions in Machine Learning with R

Explore content categories