Vanishing gradient problem simplified!!
Vanishing Gradient Problem

Vanishing gradient problem simplified!!

Find the original post on Quora

There are already a lot of good answers explaining what is a vanishing gradient mathematically. Yet I would like to explain my view of vanishing gradient with an analogy.

Let’s consider a student (say A) attends the school for the first week and later skips the classes due to illness, since he wanted to keep himself updated in the lesson being taught in the school, he asks his friend (say B) who goes to school to help him to keep him updated about the lessons being taught.

B who attends the class in school gets busy in the evening and can’t transfer his knowledge to A. So, B transfers his knowledge to his friend C (who is mutual friend of A and B) and asks him to teach A.

While B updates C about the learning, he updates only 70 percentage of the content (as he had forgot few concepts at the time of teaching). Now when C transfers that knowledge to A, he transfers only 70 percentage of what B had transferred to him.

Now, A’s learning = (70% of (C’s knowledge)) = (70% of (70% of B’s knowledge)) = (70% of (70%(original teaching))

Which makes A’ knowledge = 0.49 of what been taught in school

So A’s loss in learning is 0.51 with two intermediate friends, think what will happen if there are 10 intermediate friends. Ultimately A’s learning will be nearly 0.028 of what has been taught at school.

Below is a video which is another analogy

https://www.youtube.com/watch?v=D-YHC8b6Hjk

So what do we infer from the first story and second video?

The learning to the last person from the first person drastically reduces (or we can say no learning at all)

This is what vanishing gradient is all about. The problem occurs in deep layer networks which has many layers. Higher the number of layers higher the loss(gradients vanish).(Higher the number of person involved higher the information loss)

But think of the scenario in the story, where B is able to transfer 99% of knowledge of what had been taught in school to C and in turn C too able to transfer 99% of knowledge what had been transferred to him.

So we have to select B and C such that they have good efficiency of transferring what they have learned

In our first case where B and C were transferring 70% of knowledge can be compared with sigmoid activation function and where B and C can transfer 99% percentage can be compared with Relu activation function (note the magnitude numbers are only for illustration and doesn’t correspond to actual transfer rate)

So we can sum up that the problem of vanishing gradient happens in deep neural network where the learning happens by back propagation

But now if you would have question why Relu activation can relatively perform better than Sigmoid activation function, then you have to approach it bit mathematically, which I intend to explain later.

Related Topic: What’s the difference between gradient descent and stochastic gradient descent

Regards,

Vignesh Kathirkamar

AI Pylinux

Explained the concept with a very good example, good job vicky....

Easy to understand even from a core engineer point of view Vignesh Kathirkamar Good work.. Keep it going..

To view or add a comment, sign in

More articles by Vignesh Kathirkamar

  • Top 5 inevitable methods in OpenCV - For Beginners

    Introduction As a beginner when you start using OpenCV, it is necessary to know some of the methods which will be…

    3 Comments
  • What is a ANN?

    Introduction: Are you someone who gets fascinated by the terms AI, ML, Deep learning etc.? Then you should have heard…

    5 Comments
  • How to detect moving objects on CCTV footage using Python

    In this Article, we will see how to detect a moving object using a simple python script which uses only OpenCV library…

    7 Comments
  • How to make Geographical Plot using Python

    In this Article let's see about how geographical plotting can be done using python in just five snippets, followed by…

    1 Comment
  • Python for Simple Data Analysis

    In this article I am going to explain how a simplest data analysis can be done using python which will be useful for…

    9 Comments
  • Misinterpretation – Most deadly of human sins

    A scientist took a frog and cut its leg and shouted "jump, jump, jumpp..

  • Presence Of Mind

    A young boy who was working in a bakery was riding a cycle with a stack of eggs in the cycle’s carrier. He fell down…

Others also viewed

Explore content categories