One Hot Data Explosion
Context:
Feature Engineering is a critical part of the Machine Learning process and consumes significant amount of time. Features in ML are analogous to the dimensions in a data warehouse which are also the critical building blocks of a good analytics platform but there ends the similarity.
One of the key differences between data warehousing/analytics and machine learning algorithms is that all data needs to be presented in a numeric format in machine learning algorithms while Data Warehouses can handle both string and numerical data.
Features can be broadly classified as Categorical or Non-Categorical based on their characteristics/ values. The focus of this article is on Categorical Features.
Categorical Features:
Non-Categorical Features:
Illustration:
One of the most popular methods of dealing with Nominal Features is One Hot Encoding. There are a few other encodings but most of the machine learning materials refer to One Hot Encoding; pandas also provides a special implementation of OHE called pd.get_dummies(). Both have the same challenges highlighted below which I'm calling as "data explosion".
One Hot Encoding (OHE): OHE converts unique value of the feature into columns with values with “0” or “1”. If you have a column called Color with three values (Red, Blue, Green). Below is a visual example.
Recommended by LinkedIn
Problem Statement:
Real world datasets have many Nominal features like City, Occupation, Job Title, Patient Symptoms etc. and each can have 100’s of unique values.
If you use OHE data explosion can occur with thousands of columns blowing up your original dataset by many folds and it may not fit in the memory causing many other problems for the ML algorithms.
Below are few alternate encoding methods:
Each of these encoders have their own pros and cons. I’m publishing the results observed using some of these encoders on a customer churn analysis against a publicly available Telco dataset on Kaggle with 7043 observations with 38 features. I split the dataset into 70/30 for training and testing. The same conversions, shuffle/splits and model configuration were used for all the tests on a kaggle notebook. The train dataset contained 4930 observations and the test contained 2113 observations.
One Hot Encoder can be tuned if you change the default data type from Int64 to Int8. Considering the data stored in these columns is just 0 or 1 its safe to assume Int8 would work fine. Also do not forget to drop one column after performing one hot encoding.
OHE + PCA the memory was calculated by adding the OHE memory consumption with PCA memory consumption as you must encode the data before using the PCA transformation.
Detailed Screen Shots for the above tests.
I’m researching how to improve performance of the ML models through feature engineering and one of the primary components is encoding. I have built a new encoding technique and have been testing it out on various datasets. I would love to connect with others in the community who are doing research in this area. Please connect with me via LinkedIn.