LOGISTIC REGRESSION CLASSIFIER
How It Works (Part-1)
A Step-by-Step Complete Guide (Conceptual)
Logistic Regression is a ‘Statistical Learning’ technique categorized in ‘Supervised’ Machine Learning (ML) methods dedicated to ‘Classification’ tasks. It has gained a tremendous reputation for last two decades especially in financial sector due to its prominent ability of detecting defaulters. A contradiction appears when we declare a classifier whose name contains the term ‘Regression’ is being used for classification, but this is why Logistic Regression magical: using a linear regression equation to produce discrete binary outputs (Figure-1). And yes, it is also categorized in ‘Discriminative Models’ subgroup[1] of ML methods like Support Vector Machines and Perceptron where all use linear equations as a building block and attempts to maximize the quality of output on a training set.
We will follow the below guide throughout the article in the given order. As can be understood from the content, this article is just a conceptual manual intending to clarify technical workflow of Logistic Regression Classifier. After a long searching and reading period, I realized that there is a prolificacy for empirical studies but a striking scarcity of theoretical aspects of Machine Learning implementations. This is maybe why Yuval Noah Harari states in his famous ‘best seller’ book ‘Homodeus — A Brief History of Tomorrow’ that:
“…In fact modernity is a surprisingly simple deal. The entire contract can be summarized in a single phrase: humans agree to give up ‘meaning’ in exchange for ‘power’…”
With this article series, my aim is to create a complete guide which provides the inner meaning of each step existing in Logistic Regression workflow.
With this post as the first part of the ‘Logistic Regression’ article series, we will cover above content up to ‘Optimizing Objective’ subject. Remaining headlines will be fed with upcoming posts. Hope you enjoy…
A. Data Structure
Inputs xᵢⱼ are continuous feature-vectors (xᵢ’s) of length K, where j=1,…,k and i=1,…,n. So, the input matrix is X which contains N number of inputs (data points) each contains K number of features. Inputs can be illustrated as a matrix X like below.
And output yᵢ is discrete and binary variable, such that y ϵ {0,1}.
B. Experiment Design
Let’s we have a ‘flipping/tossing a coin’ experiment. Supposing the coin is a fair one brings us ‘equally likely’ outcomes of ‘Head’ and ‘Tail’. That is the ‘posterior’ probabilities are:
where X is an input matrix and contains all trials/observations and their features. Since in this ‘flipping coin’ experiment does not include any independent variable (feature), our input matrix X includes only the trails we made, that is it will be a vector of ‘n×1’ where x₁ is just symbolizing the first trial rather than an concrete input.
But if we replace the experiment with a ‘Credit Scoring’ one, our outcome universe will still be discrete and binary (‘Default’ and ‘Not’), however the input vector returns to a matrix again since some features has shown above!
Another radical change expecting us after shifting the experiment is the ‘uncertainty’ affecting the fairness that we assume for coins. Like unfair coins, credits are hosting different chances to be defaulted due to the different characteristics of obligators. So, our ‘posteriors’ will not be ‘equally likely’ anymore.
C. Decision/Activation Function
Unfairness described in previous part brings the problem of uncertainty in the process and the necessity of anticipation. As a ‘Supervised-Classification’ method, Logistic Regression helps us to converge those ‘uncertain’ posteriors with a differentiable[2] ‘decision function’ drawn in Figure-2 below.
This function is called as ‘logistic function’ or ‘sigmoid function’ and helps us to shrink real valued continuous inputs into a range of (0,1) which is gloriously useful while dealing with probabilities! With the help of ‘logistic function’, we can write our posteriors like below.
where f(x) is a function consisting our features (xⱼ) and their corresponding weights/coefficients (βⱼ) in a linear form shown below.
where x, β, f(x) ϵ Rᵏ and ε is representing the ‘random error process — noise’[3] inevitably happening in the data generating process.
By using the posterior equation above, we can rewrite the estimation function f(x) in the form of ‘posterior probability’ as shown below.
which is famously known as the ‘log of odds’[4] ratio. One can realize its usefulness while trying to interpreting[5] the coefficients of linear regression function f(x).
Using ‘logarithmic’ transformation helps our learning mechanism in 3 main aspects:
1- It makes the values more ‘normalized’ (big values to smaller and vice versa). Normalization (scaling) help us to reach more consistent (wrt magnitudes) coefficients that is none of them affect outcomes in a dominant way!
2- It makes operations inside of it more easier to perform (multiplications → summations, divisions → subtractions, exponents → multiplications)
3- It creates a curve/hyperplane (value sequence) which has ‘monotonicity’. Functions which are increasing or decreasing monotonically:
- can be traversed by an ‘optimization solver’[6] more efficiently with respect to time since they do not consist ‘local minima/maxima’ and
- can be a representative of the original function (not scaled one) since optimal solution for the logarithmic function will be identical with the optimal solution for the original function.
Curves/Surfaces of different based (natural or others) ‘logarithmic functions’ can be found in Figure-3 below. As can be seen, all of them are ‘monotonic’ and cutting x-axis from the same point ( log(1)=0). In Logistic Regression case, we unexceptionally use natural (10) as the base of our logarithmic function.
Passing through x=1 (where y=0) helps us to make more logical transformations in the way of interpreting the ‘Event’ and ‘No-Event’ (log odds) ratio.
So, for the cases that P(Event)>P(NoEvent) we stay in the positive side of the function, otherwise we pass the negative side. This makes a lot of sense while labeling observations in the outcome space.
D. Objective Function
Like in other Machine Learning Classifiers[7], Logistic Regression has an ‘objective function’ which tries to maximize ‘likelihood function’ of the experiment[8]. This approach is known as ‘Maximum Likelihood Estimation — MLE’ and can be written mathematically as follows.
where:
- the output y → {0,1},
- P(yᵢ|xᵢ) is the posterior probability which is equal to 1/(1+e⁻ᶠ ) and
- parameters β is the vector of ‘weights/coefficients’ in f : f(x)
as we defined earlier. Before describing and optimizing this objective with respect to parameter β, it may better to shift ‘coin’ experiment in order to simplify remaining processes. So, the ‘objective function’ of ‘flipping a coin’ problem can be written in the below format.
where p is the likelihood of success (let be Head), yᵢ is independent Bernoulli random variable (y → {0,1}) and the inner term is the ‘joint likelihood distribution’ function of the experiment. We want to find the optimum value of p in order to maximize this function. But how did we decide that maximizing it will correspond our main goal which is getting high classification accuracy. Same process performing in ‘Linear Regression’ is definitely clear, since choosing LSE as an objective function surely and obviously brings the shortest distances between predictions (yᵢ_hat) and actual targets (yᵢ).
To obtain the same clarity for Logistic Regression’s MLE case, we need to approach it in a numerical manner. To do that, let’s assign miscellaneous values to likelihoods p in the objective function of coin experiment. Those likelihoods may or may not exhibit discordance with known target values yᵢ.
We designed 4 examples that 2 of them has no discordance between p and target, that is we made a successful classification. So, when we compare the likelihood function returns, suitable assignments of p with respect to target y produces higher likelihoods! This is why we choose to maximize the ‘log likelihood function’ as an objective in Logistic Regression case above. More formally, we can summarize this logic with in two steps given below.
- for samples labeled as ‘1’ we desire to estimate p as close to 1 as possible
- for samples labeled as ‘0’ we desire to estimate (1-p) as close to 1 as possible
E. Optimizing Objectives
E.1. Getting the Gradient Equation (Differentiation)
E.1.1. Coin Experiment, ‘Average Learning’
Not to be choky, let’s leave the remaining parts of the article for a second post which I will share consecutively. All references and footnotes will be presented in the second and possibly the last post. Hope you to follow…