Logistic Regression
Is the object in picture is cat or not?
The mail I recieved is spam or not?
Am I able of qualify an exam or not (based on my score)?
In all the above example we got a sense that we are doing some classification kind of task, in which we are predicting the probability whether on the given input features the probability of an event is 1 (happening) or 0 (not happening).
Logistic Regression is a supervised machine learning approach which helps us to make a model (hypothesis) that can make prediction on the classification tasks.
Lets take a simple data with 2 input features and 1 ouput. It is an example of Binary classification as our output is always 0 or 1.
I am denoting output 1 as positive class (with cross sign) and output 0 as negative class(with dot sogn). Lets visulize this data.
Manually we can clearly see a boundary which separates these data points, we want our hypothesis to learn the optimum parameters to create that boundary.
z = w1*x1 + w2*x2 + b
We need to get ouput in between 0 and 1, but this equation would give us a number in range of (infinite possible numbers).
So here we use sigmoid function which helps us to get output in range of 0 and 1.
sigmoid_equation = 1/(1 + exp(-z))
The y_predicted that we are going to using would come from the above 2 equations. First we calculate Z and then calculate y_predicted by putting Z in sigmoid function.
Loss for Logistic Regression
In linear regression we use loss as (y_predicted - y)^2
and cost is calculated as (1/2m) * ∑ (y_pred(i) - y(i))^2 from i = 1 to m (where m is number of training examples)
If we use same loss for logistic regression then it would be problematic.
As you can see for left curve which is Cost vs parameters for linear regression which is convex and consist of only one local minima, hence can easily find best parameters for minimizing cost.
But for the right one, which is for logistic regression using same cost function as used for linear regression. We can see that it contains many local minima, which barely allow parameters to reach optimum value for minimizing cost function.
This occurs because in linear regression we use y_pred = w1*x1 + w2*x2 + b, but in logistic regression we use y_pred = sigmoid(w1*x1 + w2*x2 + b)
We need to find a loss function that can produce a convex cost vs parametes curve for logistic regression.
Keep in mind: for logistic regression y_pred = sigmoid(w1*x1 + w2*x2 + b)
Loss (for i(th) example) = -y(i) * log(y_pred(i)) - (1 - y(i)) * log(1 - y_pred(i))
Cost function would be = ( 1/m ) ∑ (-y(i)*log(y_pred(i)) - (1 - y(i))*log(1 - y_pred(i))) for i from 1 to m.
Now we need to look at the parameters updating rule (Gradient Descent)
w1 = w1 - leanring_rate * ∑dJ/dw1 (from i = 1 to m)
w2 = w2 - leanring_rate * ∑ dJ/dw2 (from i = 1 to m)
b = b - learning_rate * ∑ dJ/db (from i = 1 to n)
Perform the iteration till cost function gets minimized.
Github code link : Github code logistic regression
Here you can see by visualizing decision boundary that how parameters are changing during gradient descent which led to shifting of decision boundary to classify the points correctly.