Multi-Layer Perceptron Learning

Dilli Hang Rai

Published Mar 4, 2025

Foundation of Deep Learning | Neural Networks

Abstract

Perceptron Learning is the foundation of a single-layered network that can act as a classifier for linear problems. However, it fails to solve non-linear problems like XOR. Adding one or more extra layers of neurons converts the single-layered network to a Multi-Layered one, which can solve the non-linearity and make correct classification. This article dives into the introductory mathematics derivations and lab experiments to solve the XOR problem. Using one hidden layer between the input and output layer with an Activation function like Sigmoid MLP can learn the XOR Gate functions.

1. Introduction

Multilayer Perceptron Neural Network has stacked layer architecture. They are the Input, Hidden & Output Layers. The hidden layers are so-called because they are connected only to other network units. They are hidden from the“outside world.”[1]

There can be a ’n’ number of hidden layers but only a Single Input & Output layer. Notations like W_ij, σ_ij, and b_ij are used in MLP because it is a scalable architecture. There are two types of signal flows ie. activation function and error signal.

2. Notations used in MLP

For ease and standard use cases, we use notation.

2.1 Layers & Neurons

x_i → Inputs to the network, where i indexes the input features.
h_j → Neurons in the hidden layer, where j indexes the neuron in a given layer.
y_k → Outputs, where k indexes the output neuron.

2.2 Weights W_ij

Weights are the strength of the connection between neurons.

Notation:

W_ij → Weight from neuron I (previous layer) to neuron j (next layer).
Example: In a hidden layer, W_12 is the weight from the 1st input neuron to the 2nd hidden neuron.

2.3 Bias b_j

Bias is a learnable scalar added to the weighted sum before activation, allowing the model to shift the decision boundary (y-intercept).
b_j → Bias term for neuron j in a given layer.
Example: If a hidden neuron has bias b_2(j = 2), its activation becomes: z_2=W_12 x_1 + W_22 x2 + b_2

2.4. Activation Function σ

Definition: A function applied to the weighted sum to introduce non-linearity.
σ(z) → Activation function applied to the weighted sum z. Here j is the hidden layer.
Example: a_ j = σ(W_1j x_1 + W_2j x_2 + b_j)
Common functions: ReLU, Sigmoid, Tanh, Softmax.

2.5. Output Computation

The final output layer applies a function to transform hidden layer activations:

y_k = σ_out ( Σ W_jk * a_j + b_k ). Here K is the output layer where σ_out is the activation function of the output layer.

2.6. Forward Propagation Equation (General Form)

We know that it is a fully connected feedforward neural network where the information is processed in the forward direction and then mathematically forward propagation.

For a neuron in any layer l:

z_j^(l) = Σ W_ij^(l) * a_i^(l-1) + b_j^(l)

a_j^(l) = σ(z_j^(l))

where: - z_j^(l) = Weighted sum before activation - a_j^(l) = Output after activation

3. What about BackPropagation and Weight Updates?

Backpropagation is the process used to update the weights in an MLP by computing the gradient (How much the loss (error) changes when we change the weights.) of the loss function (difference between the predicted output y` and the true output y). with respect to the weights. This ensures that the network learns by minimizing the error.

Error = desired output — actual output

error = y — y`

y → Actual target value (ground truth)
y` → Predicted value (the output of the neural network)

4. Handwritten Notes on Derivation of BackPropagation

I find it impossible to write down all the derivations without directly inserting the handwritten notes.

Before diving into core maths. The learner must have prior knowledge of Chain Rule, Partial Derivative, Derivative (Power, Sum, Product Rule) , Gradient (Slope), and activation function (sigmoid).

5. XOR Problem:

A perception works by plotting a single straight decision boundary to separate different classes. But the XOR truth table is

If we plot these points on a 2D plane, you’ll see that no single straight line can separate the 1s from the 0s. This is why a perceptron, which relies on a linear function, fails to classify XOR correctly.

6. Algorithm: Backpropagation for MLP Neural Network

1. Initialize Parameters: — Randomly initialize weights W_ih, W_ho and biases b_h, b_o.

2. For each training epoch:

1. Forward Propagation:

— Compute the hidden layer input:

h_hidden = X W_ih + b_h

— Apply activation (Sigmoid) to the hidden layer:

a_hidden = σ(h_hidden)

— Compute the output layer input:

h_output = a_hidden W_ho + b_o

— Apply activation (Sigmoid) to the output layer:

ŷ = σ(h_output)

2. Error Calculation:

— Compute error: error = y — ŷ

3. Backpropagation:

— Compute the gradient at the output layer:

δ_output = error ⋅ ŷ ⋅ (1 — ŷ)

— Compute the gradient at the hidden layer:

δ_hidden = δ_output ⋅ W_ho^T ⋅ a_hidden ⋅ (1 — a_hidden)

4. Gradient Descent Update: — Update weights and biases:

W_ho = W_ho — η ⋅ δ_output ⋅ a_hidden^T

b_o = b_o — η ⋅ δ_output

W_ih = W_ih — η ⋅ δ_hidden ⋅ X^T

b_h = b_h — η ⋅ δ_hidden

3. Repeat until convergence (or max epochs reached).

The intuition behind the MLP to solve the XOR Problem

The hidden layer has neurons that apply weights and biases to the input values and use an activation function (like the sigmoid function). This non-linear activation allows the network to transform the inputs into a space where a linear separation is possible. A hidden layer maps the inputs to a higher-dimensional space, enabling non-linear transformations.

Conclusion:

Adding hidden layers to a neural network, specifically a Multi-Layer Perceptron (MLP), enables it to solve problems that are not linearly separable, such as the XOR problem. The key to this is the non-linearity introduced by the hidden layers and their activation functions, such as the sigmoid function(0 or 1). These layers allow the network to transform the input space into a higher-dimensional space where complex decision boundaries can be learned. Each time we compute the backpropagation we are computing the gradient and minimizing the loss function. Ultimately we converge the weights(that we call training) that correctly classify the given XOR Input.

References:

Bermúdez, J. L. (2014). Cognitive science: An introduction to the science of the mind (2nd ed.). Cambridge University Press.
Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Prentice Hall.
Kafley, Sabin. Lecture Notes on Neural Networks. Feb 24, 2025.

Appendix

“The implementation of the Multi-Layer Perceptron for the XOR problem is available on Google Colab, where users can access and run the code directly. The full implementation can be found here.”

To view or add a comment, sign in

Multi-Layer Perceptron Learning

Dilli Hang Rai

Abstract

1. Introduction

2. Notations used in MLP

2.1 Layers & Neurons

2.2 Weights W_ij

2.3 Bias b_j

2.4. Activation Function σ

2.5. Output Computation

2.6. Forward Propagation Equation (General Form)

3. What about BackPropagation and Weight Updates?

4. Handwritten Notes on Derivation of BackPropagation

Recommended by LinkedIn

5. XOR Problem:

6. Algorithm: Backpropagation for MLP Neural Network

The intuition behind the MLP to solve the XOR Problem

Conclusion:

References:

Appendix

More articles by Dilli Hang Rai

Others also viewed

Convolutional Neural Networks: Revolutionising Image and Pattern Recognition

Why a basic Artificial Neural Network struggles with learning images?

Convolutional Neural Networks

AI

Creating a Convolutional Neural Network from Scratch

Walkthrough of a Convolutional Neural Network

CNNs, algorithms for image recognition. The Series 🏞️🌅🌄🌃

Understanding Convolutional Neural Networks (CNNs) in Deep Learning

Understanding Gated Recurrent Units (GRU) in Deep Learning

Explore content categories

Abstract

1. Introduction

2. Notations used in MLP

2.1 Layers & Neurons

2.2 Weights W_ij

2.3 Bias b_j

2.4. Activation Function σ

2.5. Output Computation

2.6. Forward Propagation Equation (General Form)

3. What about BackPropagation and Weight Updates?

4. Handwritten Notes on Derivation of BackPropagation

Recommended by LinkedIn

5. XOR Problem:

6. Algorithm: Backpropagation for MLP Neural Network

The intuition behind the MLP to solve the XOR Problem

Conclusion:

References:

Appendix

More articles by Dilli Hang Rai

Radial-Basis Function and Cover’s Theorem

Least Mean Square Algorithm Part II

Linear Least Square Filters: LMS Algorithm Part I

Introduction to Parsers Part II: SLR(1),CLR(1),LALR(1)

Introduction to Parsers Part I: LL(1) And LR(0)

First() And Follow() Sets Examples

Direct Conversion Method of Regex to DFA

Relation between the Perceptron and Bayes Classifier for a Gaussian Environment

The Perceptron Convergence Theorem

Informed Search: Brief Study of Hill Climbing

Others also viewed

Convolutional Neural Networks: Revolutionising Image and Pattern Recognition

Why a basic Artificial Neural Network struggles with learning images?

Convolutional Neural Networks

AI

Creating a Convolutional Neural Network from Scratch

Walkthrough of a Convolutional Neural Network

CNNs, algorithms for image recognition. The Series 🏞️🌅🌄🌃

Understanding Convolutional Neural Networks (CNNs) in Deep Learning

Understanding Gated Recurrent Units (GRU) in Deep Learning

Explore content categories