Vanishing gradient problem simplified!!
Find the original post on Quora
There are already a lot of good answers explaining what is a vanishing gradient mathematically. Yet I would like to explain my view of vanishing gradient with an analogy.
Let’s consider a student (say A) attends the school for the first week and later skips the classes due to illness, since he wanted to keep himself updated in the lesson being taught in the school, he asks his friend (say B) who goes to school to help him to keep him updated about the lessons being taught.
B who attends the class in school gets busy in the evening and can’t transfer his knowledge to A. So, B transfers his knowledge to his friend C (who is mutual friend of A and B) and asks him to teach A.
While B updates C about the learning, he updates only 70 percentage of the content (as he had forgot few concepts at the time of teaching). Now when C transfers that knowledge to A, he transfers only 70 percentage of what B had transferred to him.
Now, A’s learning = (70% of (C’s knowledge)) = (70% of (70% of B’s knowledge)) = (70% of (70%(original teaching))
Which makes A’ knowledge = 0.49 of what been taught in school
So A’s loss in learning is 0.51 with two intermediate friends, think what will happen if there are 10 intermediate friends. Ultimately A’s learning will be nearly 0.028 of what has been taught at school.
Below is a video which is another analogy
So what do we infer from the first story and second video?
Recommended by LinkedIn
The learning to the last person from the first person drastically reduces (or we can say no learning at all)
This is what vanishing gradient is all about. The problem occurs in deep layer networks which has many layers. Higher the number of layers higher the loss(gradients vanish).(Higher the number of person involved higher the information loss)
But think of the scenario in the story, where B is able to transfer 99% of knowledge of what had been taught in school to C and in turn C too able to transfer 99% of knowledge what had been transferred to him.
So we have to select B and C such that they have good efficiency of transferring what they have learned
In our first case where B and C were transferring 70% of knowledge can be compared with sigmoid activation function and where B and C can transfer 99% percentage can be compared with Relu activation function (note the magnitude numbers are only for illustration and doesn’t correspond to actual transfer rate)
So we can sum up that the problem of vanishing gradient happens in deep neural network where the learning happens by back propagation
But now if you would have question why Relu activation can relatively perform better than Sigmoid activation function, then you have to approach it bit mathematically, which I intend to explain later.
Regards,
Vignesh Kathirkamar
Explained the concept with a very good example, good job vicky....
Absolutely amazing.
Easy to understand even from a core engineer point of view Vignesh Kathirkamar Good work.. Keep it going..