Why using one-hot encoding for training classifier

Why using one-hot encoding for training classifier

One-hot encoding is often used in supervised classifier training and I am here to address the motivation behind this design.

First, let's talk about one-hot encoding:

One-hot encoding is a method for label which class the data belongs to and the idea is to assign 0 to all the dimension except 1 for the class the data belongs to.

For example, there are 3 classes {1:Male, 2:Female, 3:Not Specified} and five instance {ID 1~5} in the left table. To transform the table (or you say data) to one-hot encoding, we transform ID 1 data from {Male} to {1,0,0} and ID 2 data from {Female} to {0,1,0} and ID3 data from {Not specified} to {0,0,1}.

(source :http://brettromero.com/wordpress/wp-content/uploads/2016/03/OHE-Image.png)

My question is why we can not just numerical value like Male is 1, and Female is 2 for training a classifier?

For explaining this, we need to know that the process to design a supervised machine learning model. And the model is updated according to the goodness of the model.

  • Step 1, design the feedforward model function, such as linear model, logistic model neural network, and so on. Let's say the model is y=f(x)=Σ wx+b.
  • Step 2, a loss function to evaluate the goodness of the model, there are also many method could be used, such as Hinge loss, Logistic loss, Square loss. The general idea of the loss function is how far the predicted answer compared to the true answer. Let's use square loss Σ (predicted answer - true answer)^2.
  • Step 3, If the function has closed form solution, then do it. If not, we can calculate the derivative of the loss function and update parameters by gradient decent. (We still can use gradient decent for the model which has closed form solution)

Given step 2 and one-hot encoder, we sum up the difference over every dimension. In this example, there are three dimension which are Male, Female and Not Specified. If we predict Male and the true answer is Female, the loss is loss=((1,0,0)-(0,1,0))^2=2. If we predict Male and the true answer is Not Specified, loss is the same. (loss=((1,0,0)-(0,0,1))^2=2)

However, if we say say Male is 1, and Female is 2, Not Specified is 3. The loss for the above two case will be 1-2=-1 and 1-3=-2 assigning different loss to the different case. This will tell the model that class 1 is closer to class 2 bur farther than class 3.

If we have 10 classes like {1 :bear, 2 :bee, 3 :dog, .... 10 :cow}, using numerical value as label will mean the distance between bear and bee is much closer than bear and cow. And if we mis-classify bear to bee, the loss is 1/10 the loss of mis-classifying bear to cow.

Usually, classification task wont need this weird assumption, and we just want the model to predict right and accumulate the loss uniformly over every wrong prediction.

Oh thank you! Someone finally explained the why. I can always find a vague explanation about it being better or that it makes weird assumptions about the data but this is the first article I've found that says why that assumption occurs. Thanks!

Nice one! I recently encounter a problem: there are too many classes in categorical feature for one-hot encoder, my solution as follow: 1. use other property to represent this attribute ( e.g. if there are 10k kinds of products, we use price, sale-start-date, average-sale-amount to represent this categorical feature ) 2. label encoding & use binary to represent it ( 2^14 > 10k, we will need at least 14 column to represent this categorical feature) I will usually use first approach, but our mentor at Google suggested second approach, I still feel doubtful about the second one :(

Such a good explanation 😬 wow wow

To view or add a comment, sign in

More articles by Adwin Jahn

Others also viewed

Explore content categories