Deriving Cross-Entropy Function for Logistic Regression

V Krish Nimmagadda

Published Aug 29, 2018

Anybody who read about or implemented Logistic Regression knows its Cost function that needs to be optimised to get the best possible estimate of the parameters/weights in the neural network. I am discussing the mathematical intuition behind this cross-entropy function.

Credits for the above pic : https://sebastianraschka.com

I am assuming that the reader is aware of the basic knowledge of Regression.

Note: Also most of the places in this article where I need to write equations, mathematical expressions, I used my jupyter notebook and hence the font becomes smaller, so request you to do a zoom-in a couple of times before you read further, this will ensure better reading experience.

Below are a few mathematical building blocks needed before I discuss the Cost Function :

Bernoulli Distribution
Binomial Distribution
Entrop in its generic form -

I will only introduce the basic concept but will reserve the detailed Information Theory based discussion for another day. Since the topic is about Cross-Entropy, I will loosely introduce this.

OK, let's dive in...

A Bernoulli distribution is a discrete probability distribution of a random variable which takes the value of 1 with a probability p and a value of 0 with a probability of 1-p. Note that the random variable can only take two values here 1 or 0. So an instance of a training sample can have an outcome of 1 or 0 - the victim is cancer prone or is not cancer prone, the email is junk or the email is not junk, tossing a coin I get heads or I don't get heads etc. One such instance is termed as a Bernoulli trail or experiment.

Mathematically it is represented as :

if X is the random variable with such a distribution, then : P(X = 1) = p = (1 - P(X = 0))

For all discrete variable scenarios such a distribution function is termed as Probability Mass Function. If the variable is continuous (mass of the mice or height of a person) then it is termed as Probability Density Function.

So the probability mass function f of this distribution over k outcomes is :

-----> CheckPoint (1)

The above equation might look clumsy but if we pause and take an example it is very easy to understand :

For a few minutes let's forget everything and just focus on the problem below :

..getting on to my jupyter notebook, so that I could explain things better :

Pause here and study the above scenario by substituting the values of total number of times and the number of successes. If one uses basic intution, the subtractions in the exponents becomes very clear and we never have to remember the equation - all you need is very basic mathematics.

...getting back to my jupyter notebook to show the equations better :

There are lot of articles on the Internet to read more about these distributions. I am limiting the explanation to the extent we need here.

Entropy & Cross-Entropy :

In its most generic form, in a given System, by measuring Entropy one is actually representing the uncertinity of an outcome. Let's say we again go back to tossing a coin :

an unbiased coin, Pr(X=H) = 1/2, Pr(X=T) = 1/2
a biased coin, say, Pr(X=H) = 3/4, Pr(X=T) = 1/4

In the above two scenarios, where do you think predictions could come better with ease ? I am sure it will be with the biased coin, because the probability of getting a Heads in this scenario is 0.75 which is higher than 0.5, as in an unbiased scenario.

This essentially is telling that in biased coin scenario, the uncertainity is less. In an unbiased coin scenario, there is more uncertainity, or randomeness or Entropy.

More the Entropy, more the uncertainity and obviously hard to predict. Less the Entropy, less the uncertainity and easy to predict.

Given this, if there is a biased die, with probabilities of : Pr(X=1) = 1/4, Pr(X=2) = 1/6, Pr(X=3) = 1/12, Pr(X=4) = 1/12, Pr(X=5) = 1/4, Pr(X=6) = 1/6 - Let's say this is how the Probability distribution is. Now, if we were to guess the outcome when one rolls the die, the best first guess could be either 1 or 5 as those are the ones which are highly probable. As one keeps rolling, our guesses also fall inline with the decending order of the respective probabilities. So it clearly says that our informational guess is weighted with the probability of a give outcome. It is dumb to predict 3 when we roll a dice following the above Probaility distribution, as it has the lowest probability. So our informational guess has to be biased with the probaility and that seems to be a better strategy to predict.

As discussed this is a weighted average if we have N equally likely outcomes, then when we take average, assuming thse are unbiased, the N in the numerator and the one in the denominator gets cancelled and we will be left with the above equation. This is the number of informational-encoding bits weighted by the probability and summate all of the points to get the actual uncertainity or the average number of bits needed to encode, which explains the overall randomeness or entropy. More the entory, more bits are needed, less the entropy less the number of bits needed to express or encode the "game".

Now if we use the strategy of biased die for an unbiased die, then obviously it is not favourable (as the probabilities for each outcome are following different distributions). So for a given game based on the probabilities of various outcomes, the guessing strategy must differ, for better outcomes - minimum number of questions to ask to find out the outcome, fully depends on the individual known probabilities of various possible outcomes.

Coming back to Machine learning scenario, there are labeled samples in the dataset drawn from a population. We do not know the true probability distribution of the population except that we start with an assumption of them being independent and identically distributed (iid). Because we dont know the true P(X), we start the guessing game by forming a function, Q(X) with various coefficients/weights which could be tweaked so that we approximate it to the unknown true P(X). The strategy of the game is to guess the right coefficients/weights, since we have labeled data along with the outcome, we know the result of a drawn sample instance. What we try to do is, we keep tweaking the coefficients/weights of Q(X), until we converge. By doing this everytime, we are altering our strategy and comparing it with the actual result (from the labeled dataset). So, from the above coin game, we know that our guess information must always be weighed with the probabiity. So the actual strategy is Q(X) and we know true labelled outcome for the same.

So the uncertainity that we need to minimize by alternating the coefficients of Q(X) :

This is the also called the KL divergence of Q from P or relative entropy of P with respect to Q. This is termed as the Cross-Entropy function too. So we minimize this Cross-Entropy function in an endevour to guess the closet Q distribution to what the population of P follows. This equation shows up many times as we study Machine Learning. I used a few words loosly here to make things simpler, but an in depth explanation requires a complete new article. Hope this intution behind Cross-Entropy is sufficient before we start working with it. Note that this is also called the Log-loss function as it estimates the loss of information (Bits) as we approximate the predicted distribution.

Cost Function for Logistic Regression :

There is lot of information on the internet about what a Logistic regression is and when to use, it becomes relatively more repetitive to discuss that. I now would like to discuss about the statistical intution of the Cost function for Logistic Regression.

Hope this helps in understanding the Cross-Entropy function for Logistic Regression and where it comes from.

Please let me know if you have any comments or questions.

Sabih Ahmad Khan 7y

The use of examples while explanation made this article simple and elegant. Thanks for sharing, I enjoyed reading it.

3 Reactions

To view or add a comment, sign in

Deriving Cross-Entropy Function for Logistic Regression

V Krish Nimmagadda

-----> CheckPoint (1)

Entropy & Cross-Entropy :

Cost Function for Logistic Regression :

More articles by V Krish Nimmagadda

Others also viewed

Sklearn Neural Network MLPClassifier

You will finally understand Regression after reading this(ML Chapter-3, Module -1)

Diving Into Anomaly Detection and Machine Learning for a 100-horsepower Motor

Linear Regression Vs Logistic Regression

Time-series regression with sktime

From Probability to Prediction: A Deep Dive into Building and Deploying a Logistic Regression Classifier

From Equations to Application: A Deep Dive into Building and Deploying a Multiple Linear Regression Model

Tensorflow 2.0 Keras APIs

Support Vector Machine Main Concepts

Sentiment Analysis on Movie Dataset using LSTM

Explore content categories

-----> CheckPoint (1)

Entropy & Cross-Entropy :

Cost Function for Logistic Regression :

More articles by V Krish Nimmagadda

Sachin and Bhavish - Another Interpretation of the Story

Know your Instances

Are you a Project Manager ?

Enterprise Social as Middleware

Good Code - It is an Art

How Modern is my Legacy App? Or how Legacy is my Modern App?

Traditional Web App with an IoT Perspective

Others also viewed

Sklearn Neural Network MLPClassifier

You will finally understand Regression after reading this(ML Chapter-3, Module -1)

Diving Into Anomaly Detection and Machine Learning for a 100-horsepower Motor

Linear Regression Vs Logistic Regression

Time-series regression with sktime

From Probability to Prediction: A Deep Dive into Building and Deploying a Logistic Regression Classifier

From Equations to Application: A Deep Dive into Building and Deploying a Multiple Linear Regression Model

Tensorflow 2.0 Keras APIs

Support Vector Machine Main Concepts

Sentiment Analysis on Movie Dataset using LSTM

Explore content categories