Newton's method for Optimization

All machine learning algorithms rely on some or the other optimization technique to find global minima for cost functions and thus, provide the best fit solution for a given problem. Out of all of them, Gradient Descent is one of the most popular technique taught in tutorials. However, this article is designed to cover another optimization technique i.e. Newton's method which is fairly easy and intuitive to understand with help of basic mathematic tools. Once understood, it can be used as a first go-to method to optimize any given function in your business domain because of its simplicity.

Motivation Problem

Before we jump to any ML jargon for its explanation, lets take a pause and think about a very basic question from our school time. (This would help us in understanding the origin of this method first and then, at the last section of the article, we will link it to ML optimizations.)

So, here is question to ponder over-

"How to find a square root of any given number without a modern calculator?"

Lets agree that, if we have to calculate square or cube of any number, it is simple enough. Just keep multiplying the same number by itself n-times. And multiplication is easy enough to do by hands. We all have done it number of times in our schools.

However, when it comes to calculate decimal or fractional powers of any given number, our basic multiplication tools disappoint us. Just try calculating values for below expressions -

In my school days, I used to wonder and assume that probably, to calculate value of square root of 2, the simplest approach could be like running many iterations of squaring decimal numbers between 1&2 and see which one gives closest value of 2 and then call that number as square root of 2.

But obviously, this method is quite naïve and inefficient and would require many iterations without proper directions to get desired results.

However, there is another efficient way by which we can solve such problems. It just requires basic mathematic trick which Newton did think of. Lets see and understand what he did to solve this.

To keep it very simple, lets stick to this problem where we need to calculate value for sqrt of 2.

Rephrasing the Motivation Problem

As a first step, lets call it as x :

Now, we will do some well known mathematical operations on this equation and you will soon start seeing emerging patterns.

Taking squares at both side and bring both of the terms to left hand side -

Looking above equation carefully, we can say that if there is a function f(x) that is given by -

then, we would like to determine those values of x that will make f(x) equal to zero. In more mathematical terms, we would like to determine "zeros" of this function, which would actually be nothing but the solution for square root of 2.

Solution to the Motivation Problem

But how do we find zeroes of this function?

Lets plot this f(x):

Looking at the above graph, it is clearly visible that there are two zeroes of this function ( i.e. there are two values of x where f(x) = 0) and they are lying between -1&-2 and 1&2 respectively. However, we would like to have an automatic optimization method that can land us to zeroes of any given function.

To solve this problem, here is what the team of Isaac Newton and Joseph Raphson did.

They said, lets start from any random point x and find values of f(x) and f'(x) i.e. first derivative of f(x), at that particular point.

For our case , if we will start from point x= 4. So, f(4) would be 4*4 -2 = 14 at x =4

And value of first derivative of f(x) at this point would be given by-

i.e. f'(x=4) = 2*4 = 8.

Before we move ahead in the story, can you think a little bit here and try recalling the interpretation of this value of first derivative of f(x)?

Yes, you are right. It is nothing but instant rate of change of f(x) at point x Or in other words, it is the slope of tangent passing thru f(x) at x = 4.

Lets draw this tangent line as well, which will pass thru point (4, 14) with a slope 8.

Now, if we can find the equation of this tangent line, it will take us to almost close to our final solution. And this is simple enough. In general linear algebra, equation of any line with a given point (x0,y0) and slope "m" is given by-

Which means, if we use above to find equation for tangent line that passes thru f(x) at x = x0, it would be given by -

Finally, since this tangent line also passes thru x-axis i.e. crosses a point where y =0, lets put this new point (x1,0) into it and we can reach to Newton's final equation by some little adjustment -

More generally, it is written as follows -

In our case, the new point x1 would be calculated as follows-

Note that this new point is closer to one of the zeroes of this function. Lets run one more iteration by taking x0 = 2.25 and see how the graph changes -

After multiple iterations, value of x1 would stop changing and converge to a final solution as soon as it reached a point x1 where where f(x=x1) = 0

As you can see it converges to 1.414, which is the actual positive solution for square root of two. If we had started with let's say x0=-3, then it would have converged to -1.414 as the solution.

Here is the simple python code implementation to show how this would work to reach out to solution -

We can use the same method for other complex expression and get the results. Here are some visual representations to find solutions for cube root for 17

Newton's method in ML optimization

Now, since we know how Newton's method was designed to help us in finding zeroes of any given function, it can also be used to optimize cost functions within modern machine learning algorithms with some little changes.

As we know that for a cost function g(x), we would like to know the minima points i.e. value of x where g(x) has minimum value rather than being zero, the only change that we shall do is that instead of finding zeroes of g(x), find zeroes for g'(x).

The optimization equation to find minima for cost function would look like below -

Similar to Gradient Descent, we can use this equation to iterate over different parameters and find global minima for cost functions. Note that both, Newton's method and Gradient Descent, use derivative fundamentals and can be used across ML problems.

Hope you find this article useful and learnt something new today. Thank you!

Special Thanks

I would like to thank and quote Luis Serrano 's course "Calculus for Machine Learning "on Coursera to help me in understanding this fundamentals. https://www.coursera.org/learn/machine-learning-calculus?specialization=mathematics-for-machine-learning-and-data-science

Newton's method for Optimization

Aditya Kansal

Motivation Problem

Rephrasing the Motivation Problem

Solution to the Motivation Problem

Recommended by LinkedIn

Newton's method in ML optimization

Special Thanks

Others also viewed

Probability for Machine Learning

Support Vector Machines for Machine Learning

🚀📈 Navigating the Gradient Descent Landscape: A Comprehensive Exploration of Machine Learning Optimization 🌌💡

Chapter 1.2: Gradient Descent with Math.

Support Vector Machine

COMBINING MACHINE LEARNING AND MATHEMATICAL OPTIMIZATION

Stanford Machine Learning Week 7 review

SVM (Support Vector Machine)

Introduction to Linear Support Vector Machines (SVM) - A Simple Explanation

How Differentiation and Differential Equations Make Machines Learn

Explore content categories

Motivation Problem

Rephrasing the Motivation Problem

Solution to the Motivation Problem

Recommended by LinkedIn

Newton's method in ML optimization

Special Thanks

Others also viewed

Probability for Machine Learning

Support Vector Machines for Machine Learning

🚀📈 Navigating the Gradient Descent Landscape: A Comprehensive Exploration of Machine Learning Optimization 🌌💡

Chapter 1.2: Gradient Descent with Math.

Support Vector Machine

COMBINING MACHINE LEARNING AND MATHEMATICAL OPTIMIZATION

Stanford Machine Learning Week 7 review

SVM (Support Vector Machine)

Introduction to Linear Support Vector Machines (SVM) - A Simple Explanation

How Differentiation and Differential Equations Make Machines Learn

Similar topics

Gradient Descent Variants

How to Optimize Machine Learning Performance

Explore content categories