#27) Change Comes From Within (within the chain rule, that is...)

David Code

Published Dec 20, 2019

My beloved students, today could easily become the Greatest Day Of Your Life as we continue our sojourn in Section 5, and today we will explore how our AI code synchs up with the math of the Chain Rule. In a burst of artistic creativity, I have (cleverly) decided to entitle this chapter, "How the Code Synchs Up with the Math." I know, right? I came up with it all by my self:

5.5) How the Code Synchs Up with the Math:

5.5.a) Removing the Intermediary Variables

My dear students, you will notice with great joy that lines 66 to 115 of our Python code appear above, with (fashionable) red lines connecting them to our ratios of change. If those connections don't look consistent to you, that is only because our original code breaks the back propagation process down into several intermediary steps with several extra, intermediary variables. Below I want you to take a look at the code after I remove these four intermediary variables:

l2_error;
l2_delta;
l1_error; and
l1_delta

Start by studying the top pieces of code (with red arrows attached). Pretend the four intermediary variables have disappeared. What does that leave? Now take a look at the bottom line of the diagram below, with the green arrows pointing upward. You'll see that the bottom line of code with green arrows is what remains when we remove the intermediary variables from the red-arrowed chunks of code. All three rows now synch up perfectly. Below, I'll explain what I mean:

5.5.b) How the Code Calculates the Same Variable that Each Ratio of Change Calculates

Mi amigos, you may recall that, with confidence measures, when we took the slope of our statistical probability, that number between 0 and 1, we simply computed rise over run using the function of x (1 - x). x is the rise, 1-x is the run.

But here, we're computing something different--and more accurate. We want to know how much a given CHANGE in syn0,1 will cause a CHANGE in l2_error. It's like saying, "A change in the run, which is syn0,1, will cause how much of a change in the rise, which is l2_error?" So, in order to figure out the rate of change in d l2_error/d syn0,1 we are going to break that big ratio of change down into 5 little parts, 5 little ratios of change, 5 little cases of "this change in run causes this much change in rise." In other words, we're going to examine each link of the chain rule separately.

Let's walk through each of these ratios of change, from right-to-left, to make sure you understand how the code synchs up with the math:

For d l1_LH / d syn0,1: We know that l0,1 is 1 (the "yes" answer of customer one to, "Do you own a cat?"). Therefore, the ratio of change will always be 1, because no matter what value you make syn0,1 to be, l1_LH will always be that value times l0,1, which is one. This makes sense because l1_LH divided by syn0 will always equal l0.

For d l1 / d l1_LH: In the top code, when you remove the intermediary variables l1_delta and l1_error, you are left with only finding the slope of l1. This is exactly the same as when we took the confidence measure of l1 in Section 4.5 of 5. So, d l1 divided by d l1_LH will always equal the slope of l1.

For d l2_LH / d l1: In the top code, when you remove the intermediary variables l1_error and l2_delta, you are left with only syn1,1. This makes sense, because d l2_LH divided by d l1 will always equal syn1.

For d l2 / d l2_LH: In the top code, when you remove the intermediary variables l2_delta and l2_error, you are left with only the slope of l2. This makes sense, because d l2 divided by d l2_LH will always equal the slope of l2.

For d l2_error / d l2: The "-1" in the bottom green code makes sense because the relationship is a negative correlation. For example, take y - l2 = l2_error, and for our first customer 1 - 0.5 = 0.5. If you increase l2 by 0.1, then 1 - 0.6 = 0.4. In other words, l2_error decreased by 0.1 when we increased l2 by 0.1. That's a 1-to-1 negative correlation.

It's key that you understand that all the slopes we are calculating above are being evaluated at the CURRENT STATE OF THE NETWORK (i.e. with weights fixed at the values used in feed forward). To use our juggler's analogy, it's like he can magically stop time for a moment and take a snapshot of all 16 pins in the air at that moment. This is like our prediction at the end of one forward feed. Since one bowling pin has changed in size, he now magically adjusts the sizes of the other 15 pins, still in mid-air, before he magically starts time moving again (i.e., the next iteration).

I hope you are beginning to see the amazing power of the chain rule to juggle all the weights of a neural network while adjusting them relative to each other. The chain rule is the guts of back prop, which is the guts of gradient descent.

Again: our goal is to calculate these 5 ratios and multiply them together in order to find the ultimate ratio of how much a change in our butterfly, syn0,1 creates the change we want in our hurricane, the l2_error. How do we calculate those ratios? Next, let's take one example of one weight and walk through all the steps of the math of the chain rule.

OK, I think that's enough material to keep you from bingeing on TV or ice cream for today, so study hard (and re-read this stuff about 800 times), and remember to find the joy even in the tiny moments, like this stunning travel photo of a street vendor of tea in Cairo (I LOVE those plastic flowers on top of his teapot!).

#27) Change Comes From Within (within the chain rule, that is...)

David Code

5.5) How the Code Synchs Up with the Math:

5.5.a) Removing the Intermediary Variables

5.5.b) How the Code Calculates the Same Variable that Each Ratio of Change Calculates

Tech and Travel

15,823 followers

More articles by David Code

Others also viewed

MLOPS: Session - 3

STT – The Light Metaphor: A Gateway to the Selector-Time Theory

Part 1 - Doug Experiments with R: dplyr & nlme Packages

Machine Learning Roadmap 2024

Time Series Episode 4: Can you trust Auto-ARIMA?

🔍 From Searching to Sorting — Why Sorting Algorithms Matter

My Learning Journey

Aleph-ω: What's New in Our C++20 Algorithm Library

Linear Regression

Explore content categories

5.5) How the Code Synchs Up with the Math:

5.5.a) Removing the Intermediary Variables

5.5.b) How the Code Calculates the Same Variable that Each Ratio of Change Calculates

Tech and Travel

15,823 followers

More articles by David Code

Authentic Japan IS Fukuoka City!

See France without the Hassle or the Crowds

You HAVE to Try Fly-Fishing!

Doing Mexico City like a local.

Sicily: Eat, Drink, History, Joy.

Bologna and Trieste, Summer 2021 (with a Ljubljana bonus). Stunning.

Florence, Italy. Summer 2021. Joy.

Lake Como, Summer 2021. Stunning Joy.

Why the people of Southern Italy are fabulous.

How to Make Friends in Italy: Molise

Others also viewed

MLOPS: Session - 3

STT – The Light Metaphor: A Gateway to the Selector-Time Theory

Part 1 - Doug Experiments with R: dplyr & nlme Packages

Machine Learning Roadmap 2024

Time Series Episode 4: Can you trust Auto-ARIMA?

🔍 From Searching to Sorting — Why Sorting Algorithms Matter

My Learning Journey

Aleph-ω: What's New in Our C++20 Algorithm Library

Linear Regression

Explore content categories