Why using one-hot encoding for training classifier

Adwin Jahn

Published Jan 31, 2017

One-hot encoding is often used in supervised classifier training and I am here to address the motivation behind this design.

First, let's talk about one-hot encoding:

One-hot encoding is a method for label which class the data belongs to and the idea is to assign 0 to all the dimension except 1 for the class the data belongs to.

For example, there are 3 classes {1:Male, 2:Female, 3:Not Specified} and five instance {ID 1~5} in the left table. To transform the table (or you say data) to one-hot encoding, we transform ID 1 data from {Male} to {1,0,0} and ID 2 data from {Female} to {0,1,0} and ID3 data from {Not specified} to {0,0,1}.

(source :http://brettromero.com/wordpress/wp-content/uploads/2016/03/OHE-Image.png)

My question is why we can not just numerical value like Male is 1, and Female is 2 for training a classifier?

For explaining this, we need to know that the process to design a supervised machine learning model. And the model is updated according to the goodness of the model.

Step 1, design the feedforward model function, such as linear model, logistic model neural network, and so on. Let's say the model is y=f(x)=Σ wx+b.
Step 2, a loss function to evaluate the goodness of the model, there are also many method could be used, such as Hinge loss, Logistic loss, Square loss. The general idea of the loss function is how far the predicted answer compared to the true answer. Let's use square loss Σ (predicted answer - true answer)^2.
Step 3, If the function has closed form solution, then do it. If not, we can calculate the derivative of the loss function and update parameters by gradient decent. (We still can use gradient decent for the model which has closed form solution)

Given step 2 and one-hot encoder, we sum up the difference over every dimension. In this example, there are three dimension which are Male, Female and Not Specified. If we predict Male and the true answer is Female, the loss is loss=((1,0,0)-(0,1,0))^2=2. If we predict Male and the true answer is Not Specified, loss is the same. (loss=((1,0,0)-(0,0,1))^2=2)

However, if we say say Male is 1, and Female is 2, Not Specified is 3. The loss for the above two case will be 1-2=-1 and 1-3=-2 assigning different loss to the different case. This will tell the model that class 1 is closer to class 2 bur farther than class 3.

If we have 10 classes like {1 :bear, 2 :bee, 3 :dog, .... 10 :cow}, using numerical value as label will mean the distance between bear and bee is much closer than bear and cow. And if we mis-classify bear to bee, the loss is 1/10 the loss of mis-classifying bear to cow.

Usually, classification task wont need this weird assumption, and we just want the model to predict right and accumulate the loss uniformly over every wrong prediction.

Jordan Taylor 7y

Oh thank you! Someone finally explained the why. I can always find a vague explanation about it being better or that it makes weird assumptions about the data but this is the first article I've found that says why that assumption occurs. Thanks!

1 Reaction

Ting-En Lin 9y

Nice one! I recently encounter a problem: there are too many classes in categorical feature for one-hot encoder, my solution as follow: 1. use other property to represent this attribute ( e.g. if there are 10k kinds of products, we use price, sale-start-date, average-sale-amount to represent this categorical feature ) 2. label encoding & use binary to represent it ( 2^14 > 10k, we will need at least 14 column to represent this categorical feature) I will usually use first approach, but our mentor at Google suggested second approach, I still feel doubtful about the second one :(

1 Reaction

Meng-Jiun Chiou 9y

Such a good explanation 😬 wow wow

1 Reaction

See more comments

To view or add a comment, sign in

Why using one-hot encoding for training classifier

Adwin Jahn

More articles by Adwin Jahn

Others also viewed

Multiclass vs. Multilabel Classification: Understanding the Differences Clearly

Will Machine Learning Revive The Generalist?

Understanding Variational AutoEncoders: A Simple Guide

Machine Learning: Regularization Techniques - Pros & Cons

How is Artificial Intelligence (Ai) implemented?

A Disciplined Use of Machine Learning in EW: Learn the Noise, Keep the Detector Interpretable

What Are Image Embeddings for Computer Vision Data Curation?

Unlocking the Potential of AI Models: A Comparison Between Feature Extraction and Fine-Tuning

Building GPT From Scratch:

Overfitting vs. Underfitting: Striking the Balance in Machine Learning

Explore content categories

More articles by Adwin Jahn

Needs talent of video processing/CV/ML

Keras Image Preprocessing: scaling image pixels for training

Is Logistic Regression classification linear?

NodeJS to control and visualize IOT data collected from TI-2530

Best Overall App of AT&T IoT Seattle Hackathon July 2016

Lipreading Demo by Convolutional Neural network (Python)

Others also viewed

Multiclass vs. Multilabel Classification: Understanding the Differences Clearly

Will Machine Learning Revive The Generalist?

Understanding Variational AutoEncoders: A Simple Guide

Machine Learning: Regularization Techniques - Pros & Cons

How is Artificial Intelligence (Ai) implemented?

A Disciplined Use of Machine Learning in EW: Learn the Noise, Keep the Detector Interpretable

What Are Image Embeddings for Computer Vision Data Curation?

Unlocking the Potential of AI Models: A Comparison Between Feature Extraction and Fine-Tuning

Building GPT From Scratch:

Overfitting vs. Underfitting: Striking the Balance in Machine Learning

Explore content categories