Is Gradient Decent Optimizer An Alchemy of Loss?
Let’s begin with example, suppose one person is in the top of a mountain so he has to reach a bottom and he is blindfolded, now guess how he reach the bottom? obviously he will step his foot as per the terrain and move with a comfortable speed neither to high nor to slow and reach the bottom at a time. Likewise gradient decent is a iterative optimization algorithm for finding the local minimum of a function.
Here to find a local minimum of a function using gradient descent, we must take our parameter away from a gradient of a function at the current point if the derivative is positive(that means toward left ) and if the derivative is negative(that means function is decreasing), we need to move our parameter to the right to decrease the value of the function even more. Subtracting a negative value from a parameter moves it to the right.
Why Gradient Decent?
Let’s take a example of linear regression which looks like f(x) = wx+b. Here we don’t know what is the optimal value of w and b so we have to learn it from data. To do that we have to search a value of w and b such that they minimize the loss which is mean square loss(MSE) and given as:
To find that value of w and b we use gradient descent.
Steps In Calculating Gradient Descent:
1. First we have to calculate the partial derivative of a loss function with respect to w and b:
2. Update the value of w and b (Initially value of w and b is set to 0)
We subtract (as opposed to adding) partial derivatives from the values of parameters because derivatives are indicators of growth of a function. If a derivative is positive at some point, then the function grows at this point. Because we want to minimize the objective function, when the derivative is positive we know that we need to move our parameter in the opposite direction (to the left on the axis of coordinates). When the derivative is negative (function is decreasing), we need to move our parameter to the right to decrease the value of the function even more. Subtracting a negative value from a parameter moves it to the right.
Now implement all this step using python code:
Here first we have to see the equation of mean square loss equation of linear equation which we have used in following code:
Recommended by LinkedIn
And the partial derivative we have calculated and actively used in following code is:
#Implementing Gradient Decent from Scratch
def update_w_and_b(X,Y,w,b,alpha):
dl_dw = 0 #here X means data points and Y means Target
dl_db = 0
n = len(X)
for i in range(n):
dl_dw+=-2*X[i]*(Y[i]-(w*X[i] + b))
dl_db+= -2*(Y[i]-(w*X[i] + b))
#update w and b
w = w - 1/(float(n))*alpha*dl_dw
b = b - 1/(float(n))*alpha*dl_db
return w,b
What Is the Role of Alpha Here?
We know alpha is the learning rate, it shows what is the size of step our gradient decent should take to reach the global minimum(Smallest local minimum among different local minimum ).
1. If the learning rate is too small it converges to global minimum but takes too much time.
2. If the learning rate is too high it shootout and does not converges.
3. If the learning rate is as required, our algorithm converges at a required time.
THATS IT FOR TODAY
HAPPY LEARNING
In code snippet X means feature vector not a data points.