Vanishing gradient problem simplified!!

Vignesh Kathirkamar

Published Oct 13, 2021

Find the original post on Quora

There are already a lot of good answers explaining what is a vanishing gradient mathematically. Yet I would like to explain my view of vanishing gradient with an analogy.

Let’s consider a student (say A) attends the school for the first week and later skips the classes due to illness, since he wanted to keep himself updated in the lesson being taught in the school, he asks his friend (say B) who goes to school to help him to keep him updated about the lessons being taught.

B who attends the class in school gets busy in the evening and can’t transfer his knowledge to A. So, B transfers his knowledge to his friend C (who is mutual friend of A and B) and asks him to teach A.

While B updates C about the learning, he updates only 70 percentage of the content (as he had forgot few concepts at the time of teaching). Now when C transfers that knowledge to A, he transfers only 70 percentage of what B had transferred to him.

Now, A’s learning = (70% of (C’s knowledge)) = (70% of (70% of B’s knowledge)) = (70% of (70%(original teaching))

Which makes A’ knowledge = 0.49 of what been taught in school

So A’s loss in learning is 0.51 with two intermediate friends, think what will happen if there are 10 intermediate friends. Ultimately A’s learning will be nearly 0.028 of what has been taught at school.

Below is a video which is another analogy

https://www.youtube.com/watch?v=D-YHC8b6Hjk

So what do we infer from the first story and second video?

Recommended by LinkedIn

LLMs and teaching, with a Zero-Knowledge-Proof example.

Jad Nohra 1 year ago

Calibrating an LLM to “Know When It Doesn’t Know” — A…

Vishwesh Ravi Shrimali 7 months ago

🚀 Mastering Learning Rate Decay: Exponential…

nagababu molleti 1 year ago

The learning to the last person from the first person drastically reduces (or we can say no learning at all)

This is what vanishing gradient is all about. The problem occurs in deep layer networks which has many layers. Higher the number of layers higher the loss(gradients vanish).(Higher the number of person involved higher the information loss)

But think of the scenario in the story, where B is able to transfer 99% of knowledge of what had been taught in school to C and in turn C too able to transfer 99% of knowledge what had been transferred to him.

So we have to select B and C such that they have good efficiency of transferring what they have learned

In our first case where B and C were transferring 70% of knowledge can be compared with sigmoid activation function and where B and C can transfer 99% percentage can be compared with Relu activation function (note the magnitude numbers are only for illustration and doesn’t correspond to actual transfer rate)

So we can sum up that the problem of vanishing gradient happens in deep neural network where the learning happens by back propagation

But now if you would have question why Relu activation can relatively perform better than Sigmoid activation function, then you have to approach it bit mathematically, which I intend to explain later.

Regards,

Vignesh Kathirkamar

AI Pylinux

Sravan A Dinesh 4y

Explained the concept with a very good example, good job vicky....

2 Reactions

Tushar Wagh 4y

Absolutely amazing.

1 Reaction

Parishith Jayan 4y

Easy to understand even from a core engineer point of view Vignesh Kathirkamar Good work.. Keep it going..

Vanishing gradient problem simplified!!

Vignesh Kathirkamar

Recommended by LinkedIn

More articles by Vignesh Kathirkamar

Others also viewed

From Knowledge Access to Mastery: How RAFT Supercharges RAG

A Roadmap to Machine Learning

Can’t. Stop. Building.

The LLM Roadmap — 30 Days of Deliberate Learning

LLM integration patterns for education and research using ADALFLOW

Ultimate Free Agentic AI Courses 2025: Beginner to Expert Compiled by Surinder Lall

The Era of the Shallow Learner

Building using Model Context Protocol

🚀 LLM Learning Series – Day 2

How I adapt to the Culture of Bangkit

Explore content categories

Recommended by LinkedIn

More articles by Vignesh Kathirkamar

Top 5 inevitable methods in OpenCV - For Beginners

What is a ANN?

How to detect moving objects on CCTV footage using Python

How to make Geographical Plot using Python

Python for Simple Data Analysis

Misinterpretation – Most deadly of human sins

Presence Of Mind

Others also viewed

From Knowledge Access to Mastery: How RAFT Supercharges RAG

A Roadmap to Machine Learning

Can’t. Stop. Building.

The LLM Roadmap — 30 Days of Deliberate Learning

LLM integration patterns for education and research using ADALFLOW

Ultimate Free Agentic AI Courses 2025: Beginner to Expert Compiled by Surinder Lall

The Era of the Shallow Learner

Building using Model Context Protocol

🚀 LLM Learning Series – Day 2

How I adapt to the Culture of Bangkit

Explore content categories