Deep dive into Logistic Regression and Regularization
Logistic regression falls under the category of supervised learning; it measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function.
In spite of the name ‘logistic regression,’ this is not used for the regression problem where the task is to predict the real-valued output. It is a classification problem that is used to predict a binary outcome (1 or 0, -1 or 1, True or False) given a set of independent variables.
Logistic regression is a bit similar to the linear regression or we can see it as a generalized linear model.
In linear regression, we predict a real-valued output y based on a weighted sum of input variables.
y=c+x1∗w1+x2∗w2+x3∗w3+........+xn∗wn
The aim of linear regression is to estimate values for the model coefficients c, w1, w2, w3...wn and fit the training data with minimal squared error and predict the output y.
Logistic regression does the same thing, but with one addition. It runs the result through a special non-linear function called the logistic function or sigmoid function to produce the output y.
y=logistic(c+x1∗w1+x2∗w2+x3∗w3+........+xn∗wn)
y=1/1+e[−(c+x1∗w1+x2∗w2+x3∗w3+........+xn∗wn)]
The sigmoid/logistic function is given by the following equation.
y=1/1+e^-x
As you see in the graph, it is an S-shaped curve that gets closer to 1 as the value of the input variable increases above 0 and gets closer to 0 as the input variable decreases below 0. The output of the sigmoid function is 0.5 when the input variable is 0.
Thus, if the output is more than 0.5, we can classify the outcome as 1 (or positive) and if it is less than 0.5, we can classify it as 0 (or negative). In other words, the logistic regression out is the probability of base class in a binary classification problem.
Logistic Regression can be understood by Geometry, Probability and loss function based interpretation and we will get the same solution for all 3 interpretation but there are different ways to understand it.
Equation of a plane
Equation of a plane through a point A = (x1, y1, z1) in 3D space whose normal vector n = (a, b, c) is defined as
a(x-x1) + b(y -y1)+ c(z-z1) + b= 0
ax + by + cz + b= 0 where b = -(ax1+by1+cz1)
For simplicity we can also write is as
w1x1 +w2x2 + w3x3 + b= 0
Which is same as w^t * xi + b= 0 where xi is the ith observation and if the plane passes through the origin then the equation becomes w^t *xi = 0. Where w^t (read as w transpose) is row vector and xi is column vector and b(intercept/bias) is scalar. If we have 2-dimension space then the equation becomes w1x1 + w2x2 + b = 0, If we have n-dimension space then equation becomes w0 + w1x1 + w2x2 + w3x3 +………..….wnxn = 0, we can extend it any dimension as it is a linear equation.
Geometric interpretation of logistic regression
Let, x (Predictor/Features/Independent variable), y (Response/Target/Dependent variable) be the dataset(D) i.e. D ∈ {xi , yi} of n data-points. Where xi ∈ ℝ^d for all ith observation, that is each xi’s is a real-valued d dimension feature vector and yi ∈ (-1(-ve), 1(+ve)) that is each yi’s are either 1 or -1. The underlying assumption of logistic regression is data are almost (i.e. some +ve class points are in -ve class and vice-versa) or perfect (None of the points are mixed with other class) linearly separable (Figure -1) and our main objective is to find a line(in 2D) or plane/hyperplane(in 3D or more dimension) that can separate both the classes point as perfect as possible so that when it encounters with any new point It can easily classify, from which class point it belongs to. I.e. x and y are fixed because they are coming from training data so if we can find w (Normal) and b (bias/ y-intercept) then we can easily find a line or plane also called decision boundary. Here, we will focus on just two features (x1 and x2) only so that the intuition becomes easy. Although, in machine learning, it is almost impossible to have 2 or 3-dimension data.
Figure1:
In Figure - 2, If we take any of the +ve class points and compute distance from a point to plane (di = w^t*xi/||w||. let, norm (||w||) is 1). Since w and xi in the same side of the decision boundary then distance will be +ve. Now compute dj = w^t*xj since xj is the opposite side of w then distance will be -ve. If we say, points which are in the same direction of w are all +ve points and the points which are in the opposite direction of w are -ve points.
Figure2:
Now, we could easily classify the -ve and +ve points using w^t*xi>0 then y =+1 and If w^t*xi < 0 then y = -1. While doing this we could do some mistake but it is okay because in real-world we will never get data which are perfectly separable.
Observations:
Look at figure 2 visually and observe all the listed points below-
- If yi = +1 means it is +ve data-points and w^t*xi > 0 i.e classifier(A mathematical function, implemented by a classification algorithm, that maps input data to a category.) is saying it is +ve points. So what happened, if yi*w^t*xi > 0 then it is correctly classified points because multiplying two +ve number will always be greater than 0.
- If yi = -1 means it is -ve data-points and w^t*xi < 0 i.e. classifier is saying it is -ve points. if yi * w^t*xi > 0 then it is correctly classified points because multiplying two -ve numbers will always be greater than zero. So, for both +ve and -ve points yi* w^t*xi > 0 this implies the model is correctly classifying the points xi.
- If yi = +1 and w^t*xi < 0 i.e. yi is +ve points but classifier is saying it is -ve then we will get -ve value. This means the actual class label is +ve but it is classified as -ve then this is miss-classified points.
- If yi = -1 and w^t*xi > 0. Which means actual class label is -ve but classified as +ve then it is miss-classified points( yi*w^t*xi < 0).
From the above observations, we want our classifier to minimize the miss-classification error. I.e. we want yi*w^t*xi to be greater than 0. Here, xi, yi is fixed because these are coming from the data-set. As we change w, and b the sum will change and we want to find such w and b that maximize that sum given below.
Need for Logistic Function or “ S” shape curve or Sigmoid Function
The sigmoid function is a differentiable real function that is defined for all real input and has a non-negative derivative at each point. It is a monotonic function in which squashes value between 0 and 1. We will look at a very simple example where we will see how the sum of signed distances (yi*w^t*xi) can be impacted by erroneous/outlier points and we need to come up with another formulation which is less impacted by the outlier.
Suppose in the left figure 3, the distance (d) from any point to decision boundary is 1 for all -ve side of decision boundary points and +ve side of decision boundary points, except an outlier point which is in the +ve side of the decision boundary and the distance is 100. If we compute the signed distance then it will be -90. In the right figure 3, the distance (d) from any point to decision boundary is 1 and their distances from each other are also 1. If we compute the signed distance then it will be 1. So, we have 5 miss-classified points (point is -ve but are in +ve side of the decision boundary) in right figure 3 and the sum of signed distance is -90. In left figure 3, we have 1 miss-classified point and sum of signed distance is 1. And remember we wanted to maximize the sum of signed distances which is 1 in this case. So, If we choose sum of signed distance, in the presence of outlier, our prediction may not correct and we end up with the worst model.
Figure3:
So, to avoid this problem we need another function that can be more robust than the maximizing signed distances. Such function we use here is called the sigmoid function and is defined as
So, we need to maximize the sigmoid function which is defined as
Maximizing some function f(x) is the same as minimizing this function with -ve sign. I.e. argmax f(x) = argmin -f(x) and if we take log (we will discuss why use log in the loss minimization interpretation) then the final formulation becomes-
Probabilistic interpretation of logistic regression
Before we dig deep into a logistic regression, we need to clear up some of the fundamentals of probability. For simplicity, we will consider a dataset that tells us about depending on the gender, whether a customer will purchase a product or not. We import and check the data-set
import pandas as pd
gender_df = pd.read_csv('gender_purchase.csv')
print gender_df.head(3)
>>> Gender Purchase
0 Female Yes
1 Female Yes
2 Female No
We will create a table of frequency of ‘yes’ and ‘no’ depending on the gender, using the crosstab feature of pandas. The table will be of great use to understand odds and odds ratios later on.
table = pd.crosstab(gender_df['Gender'], gender_df['Purchase']) print table >>> Purchase No Yes Gender Female 106 159 Male 125 121
We’re now ready to define Odds, which describes the ratio of success to ratio of failure. Considering the female's group, we see the probability that a female will purchase (success) the product is = 159/265 (yes/total number of females). Probability of failure (no purchase) for a female is 106/265. In this case, the odds are defined as (159/265)/(106/265) = 1.5. Higher the odds, the better is the chance for success. The range of odds can be any number between [0 , ∞]. What happens to the range if we take a natural logarithm of such numbers? log(x) is defined for x≥0 but the range varies from [-∞, ∞]. You can check with a snippet of code
random=[] xlist = [] for i in range(100): x = uniform(0,10)# choose numbers between 0 and 10 xlist.append(x) random.append(math.log(x)) plt.scatter(xlist, random, c='purple',alpha=0.3,label=r'$log x$') plt.ylabel(r'$log \, x$', fontsize=17) plt.xlabel(r'$x$',fontsize=17) plt.legend(fontsize=16) plt.show()
So far we have understood odds. Let’s describe the Odds ratio, which as the name suggests, is the ratio of odds. Considering the example above, the Odds ratio represents which group (male/female) has better odds of success, and it’s given by calculating the ratio of odds for each group. So odds ratio for females= odds of successful purchase by female/odds of successful purchase by male = (159/106)/(121/125). The odds ratio for males will be the reciprocal of the above number.
We can appreciate clearly that while the odds ratio can vary between 0 to positive infinity, log (odds ratio) will vary between [-∞, ∞]. Specifically, when the odds ratio lies between [0,1], log (odds ratio) is negative.
Since confusingly the ‘regression’ term is present in logistic regression, we may spare a few seconds to review regression. Regression usually refers to continuity i.e. predicting continuous variables (medicine price, taxi fare etc.) depending upon features. However, logistic regression is about predicting binary variables i.e when the target variable is categorical.
In linear regression where feature variables can take any values, the output (label) can thus be continuous from negative to positive infinity.
Since logistic regression is about classification, i.e Y is a categorical variable. It’s clearly not possible to achieve such output with a linear regression model (Eq. 1.1) since the range on both sides does not match. Our aim is to transform the LHS in such a way that it matches the range of RHS, which is governed by the range of feature variables, [-∞, ∞].
We will follow some intuitive steps to search for how it’s possible to achieve such an outcome.
For linear regression, both X and Y ranges from minus infinity to positive infinity. Y in logistic is categorical, or for the problem above it takes either of the two distinct values 0,1. First, we try to predict the probability using the regression model. Instead of two distinct values now the LHS can take any values from 0 to 1 but still the ranges differ from the RHS.
I discussed above that odds and odds ratio varies from [0, ∞]. This is better than probability (which is limited between 0 and 1) and one step closer to match the range of RHS.
Many of you have already understood that if we now consider a natural logarithm on LHS of (eq. 1.3) then the ranges on both sides match.
With this, we have achieved a regression model, where the output is the natural logarithm of the odds, also known as logit. The base of the logarithm is not important but taking the logarithm of odds is.
We can retrieve the probability of success from eq. 1.4 as below.
Logistic Function
If you see the RHS of equation 1.5., which is also known as the logistic function, is very similar to the sigmoid function. We can check the behavior of such function with a snippet of python code.
random1=[]
random2=[]
random3=[]
xlist = []
theta=[10, 1,0.1]
for i in range(100):
x = uniform(-5,5)
xlist.append(x)
logreg1 = 1/(1+math.exp(-(theta[0]*x)))
logreg2 = 1/(1+math.exp(-(theta[1]*x)))
logreg3 = 1/(1+math.exp(-(theta[2]*x)))
random1.append(logreg1)
random2.append(logreg2)
random3.append(logreg3)
plt.scatter(xlist, random1, marker='*',s=40, c='orange',alpha=0.5,label=r'$\theta = %3.1f$'%(theta[0]))
plt.scatter(xlist, random2, c='magenta',alpha=0.3,label=r'$\theta = %3.1f$'%(theta[1]))
plt.scatter(xlist, random3, c='navy',marker='d', alpha=0.3,label=r'$\theta = %3.1f$'%(theta[2]))
plt.axhline(y=0.5, label='P=0.5')
plt.ylabel(r'$P=\frac{1}{1+e^{-\theta \, x}}$', fontsize=19)
plt.xlabel(r'$x$',fontsize=18)
plt.legend(fontsize=16)
plt.show()
Figure 2: Probability vs independent variable x; resembles sigmoid function plot.
From the plot above, notice that higher the value of the coefficient (orange stars) of the independent variable (here X), better it can represent two distinct probabilities 0 and 1. For a lower value of the coefficient, it’s essentially a straight line, resembling a simple linear regression function. Comparing with equation (1.5), in figure 2, the fixed term 'a' is taken as 0. The effect of the fixed term on the logistic function can also be understood using the plot below
Figure 3: Sigmoid function for different values of intercept (a).
Just like in linear regression where the constant term denotes the intercept on the Y-axis (hence a shift along Y-axis), here for logistic function, the constant term shifts the s curve along the X-axis. The figures above (Fig 2, 3) should convince you that it’s indeed possible to optimize a model using logistic regression that can classify data i.e. predict 0 or 1.
Log loss Function:
Purple color is hinge loss, yellow is log loss, blue is 0 - 1 loss and green is mean squared error.
By looking at the plots above, this nature of curves brings out a few major differences between logistic loss and hinge loss —
- Note that the logistic loss diverges faster than hinge loss. So, in general, it will be more sensitive to outliers — why? Because, assuming there is an outlier in our data, the logistic loss (due to its diverging nature) will be very high compared to the hinge loss for that outlier. This means greater adjustments to our weights.
- Note that the logistic loss does not go to zero even if the point is classified sufficiently confidently — The horizontal axis is the confidence of ‘predicted y’ value, if we take a value like ‘1.5’ on x-axis, then the corresponding logistic loss (yellow line) still shows some loss (close to 0.2 from the above plot and hence still not very confident of our prediction), whereas the hinge loss is ‘0’ ( which means there is no loss and we are more confident of our prediction). This nature of logistic loss might lead to minor degradation in accuracy.
- Logistic regression has a more probabilistic interpretation.
Over-fitting and Regularization:
In supervised machine learning, models are trained on a subset of data aka training data. The goal is to compute the target of each training example from the training data.
Now, over-fitting happens when the model learns signal as well as noise in the training data and wouldn’t perform well on new data on which model wasn’t trained on.
Now, there are few ways you can avoid over-fitting your model on training data like cross-validation sampling, reducing the number of features, pruning, regularization etc.
Regularization basically adds the penalty as model complexity increases. The regularization parameter (lambda) penalizes all the parameters except intercept so that the model generalizes the data and won’t overfit.
In order to create less complex (parsimonious) model when you have a large number of features in your dataset, some of the Regularization techniques used to address over-fitting and feature selection are:
1. L1 Regularization
2. L2 Regularization
A regression model that uses the L1 regularization technique is called Lasso Regression and the model which uses L2 is called Ridge Regression.
The key difference between these two is the penalty term.
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the highlighted part represents the L2 regularization element.
Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid over-fitting issue.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.
Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence it will under-fit.
The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some features altogether. So, this works well for feature selection in case we have a huge number of features.
Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection work well with a small set of features but these techniques are a great alternative when we are dealing with a large set of features.
Hyperparameter Tuning:
Lamda is the hyperparameter in the Logistic Regression, lambda is also called regressor (L1 or L2). If we give lambda is less will encounter overfitting problem and if we give lambda is high will encounter the underfitting problem.
we have to tune the hyperparameter in such a way that it can trade-off the Bias-Variance.
Find the best hyperparameter using k-fold cross-validation or simple cross-validation data.
Use GridSearchCV or RandomizedSearchCV or you can also write your own for loops to do this task of hyperparameter tuning.
Hyperparameter 'K' is an integer in the KNN algorithm, but Lamda is real value in Logistic Regression.
Best set of lambda values are [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]
Hyperparameter tuning plot --> on X-axis: Lambda value and Y-axis CV error value.
Elastic hyperparameter tuning --> on X-axis Lambda1 value, Y-axis Lambd2 value and on Z-axis CV error value.
Different cases for Logistic Regression:
1. The decision surface is hyper-plane (in higher dimensions) and LR assumes that data is linearly separable.
2. We can get feature importance by L1-regularization but we have to ensure that there is no multicollinearity.
3. Impact of outliers: Sigmoid function takes care of outliers. To remove outlier we can implement the following mechanism.
Compute W by training model
For every point Xi calculates WTXi i.e. the distance of Xi from the plane.
Remove all the Xi’s which are very far away from the plane. Again train the model and get the classifier w
4. Multi-class classification: Use One vs All or One vs One technique.
5. Impact of dimensions: It should be as high as possible.
Feature Importance:
We can get feature importance by L1-regularization but we have to ensure that there is no multicollinearity.
Impact of imbalanced Data:
Logistic Regression gets impacted by imbalanced data.
We need to convert it to the balanced dataset using upsampling/downsampling techniques.
Impact of Outliers:
Impact of outliers: Sigmoid function takes care of outliers. To remove outlier we can implement the following mechanism.
Compute W by training model
For every point Xi calculates WTXi i.e. the distance of Xi from the plane.
Remove all the Xi’s which are very far away from the plane. Again train the model and get the classifier w
Impact of Missing Values:
Logistic Regression gets impacted by missing values, we need to handle missing values using pandas techniques/Impute model
Multinomial Classification:
Logistic regression is one of the most popular supervised algorithms. This classification algorithm mostly used for solving binary classification problems. People follow the myth that logistic regression is only useful for binary classification problems.
Which is not true. The logistic regression algorithm can also use to solve multi-classification problems.
- One Versus Rest (OVR)
- One Versus One
Feature Transformation in Logistic Regression:
The main assumption of Logistic Regression is Linearly Separable. what we have to do in case we encounter non-linear condition. first, we have to do the future transformation that converts the non-linear problem into the linear problem then applies logistic regression.
Time and Space complexity of Logistic Regression:
Time Complexity:
Training time : We need to solve, W* = argmin {loss-term + regularization}
Therefore, T(n) = O(nd)
Testing phase : T(n) = O(d)
Space Complexity :
Training: O(n)
Testing: O(d)
Interview questions on Logistic Regression:
1. Is Logistic Regression is a supervised learning algorithm? Ans: Yes
2. Is Logistic regression mainly used for regression? Ans: False
3. Is it possible to design a logistic regression algorithm using a Neural Network Algorithm? Ans: Yes
4. Is it possible to apply a logistic regression algorithm on a 3-class Classification problem? Ans: Yes
5. Which of the following methods do we use to best fit the data in Logistic Regression? Ans: B
A) Least Square Error
B) Maximum Likelihood
C) Jaccard distance
D) Both A and B
Maximum likelihood estimation is a method that will find the values of μ and σ that result in the curve that best fits the data. ... The goal of maximum likelihood is to find the parameter values that give the distribution that maximize the probability of observing the data
Probability - distribution is fixed we have to find probability of getting observed value using that distribution.
Likelihood - observed data is fixed and distribution is varying, using varying distribution we have to find best-fit distribution for observed data
6. Which of the following evaluation metrics can not be applied in the case of logistic regression output to compare with the target? Ans: D
A) AUC-ROC
B) Accuracy
C) Log loss
D) Mean-Squared-Error
Since Logistic Regression is a classification algorithm so it’s output can not be real-time value so mean squared error can not use for evaluating it.
7. Standardization of features is required before training a Logistic Regression. Ans: B
A) TRUE
B) FALSE
Standardization isn’t required for logistic regression. The main goal of standardizing features is to help the convergence of the technique used for optimization.
8. Below are the three scatter plots (A, B, C left to right) and hand-drawn decision boundaries for logistic regression.
i. Which of the following above figure shows that the decision boundary is overfitting the training data? Ans: C
A) A
B) B
C) C
D)None of these
Since in figure 3, the Decision boundary is not smooth which means it will over-fitting the data.
ii. What do you conclude after seeing this visualization? Ans: C
- The training error in the first plot is maximum as compared to the second and third plots.
- The best model for this regression problem is the last (third) plot because it has minimum training error (zero).
- The second model is more robust than first and third because it will perform best on unseen data.
- The third model is overfitting more as compared to first and second.
- All will perform the same because we have not seen the testing data.
A) 1 and 3
B) 1 and 3
C) 1, 3 and 4
D) 5
The trend in the graphs looks like a quadratic trend over independent variable X. A higher degree(Right graph) polynomial might have very high accuracy on the train population but is expected to fail badly on the test dataset. But if you see in the left graph we will have training error maximum because it underfits the training data
iii. Suppose, the above decision boundaries were generated for the different values of regularization. Which of the above decision boundary shows the maximum regularization? Ans: A
A) A
B) B
C) C
D) All have equal regularization
Since more regularization means more penalty means less complex decision boundary that shows in first figure A.
9. The below figure shows AUC-ROC curves for three logistic regression models. Different colors show curves for different hyperparameters values. Which of the following AUC-ROC will give the best result? Ans: A
A) Yellow
B) Pink
C) Black
D) All are same
The best classification is the largest area under the curve so the yellow line has the largest area under the curve.
10. What are the outputs of the logistic model and the logistic function?
The logistic model outputs the logits, i.e. log-odds; and the logistic function outputs the probabilities.
Logistic model = α+1X1+2X2+….+kXk. The output of the same will be logits.
Logistic function = f(z) = 1/(1+e-(α+1X1+2X2+….+kXk)). The output, in this case, will be the probabilities.
11. What is Deviance?
It is a measure of goodness of fit of a generalized linear model.
Higher the deviance value,poorer is the model fit. Now we will discuss pointwise about the summary
12. What is AIC?
Its full form is the Akaike Information Criterion (AIC).
This is useful when we have more than one model to compare the goodness of fit of the models.
It is a maximum likelihood estimate which penalizes to prevent overfitting. It measures the flexibility of the models.
It's analogous to adjusted R2 in multiple linear regression where it tries to prevent you from including irrelevant predictor variables.
Lower AIC of the model is better than the model having higher AIC
13. In logistic regression probability value varies from? 0 to 1
14. In logistic regression odd ratio varies from? 0 to +infinity
15. In logistic regression, logit value varies from? -infinity to +infinity