Multi-Layer Perceptron Learning
Foundation of Deep Learning | Neural Networks
Abstract
Perceptron Learning is the foundation of a single-layered network that can act as a classifier for linear problems. However, it fails to solve non-linear problems like XOR. Adding one or more extra layers of neurons converts the single-layered network to a Multi-Layered one, which can solve the non-linearity and make correct classification. This article dives into the introductory mathematics derivations and lab experiments to solve the XOR problem. Using one hidden layer between the input and output layer with an Activation function like Sigmoid MLP can learn the XOR Gate functions.
1. Introduction
Multilayer Perceptron Neural Network has stacked layer architecture. They are the Input, Hidden & Output Layers. The hidden layers are so-called because they are connected only to other network units. They are hidden from the“outside world.”[1]
There can be a ’n’ number of hidden layers but only a Single Input & Output layer. Notations like W_ij, σ_ij, and b_ij are used in MLP because it is a scalable architecture. There are two types of signal flows ie. activation function and error signal.
2. Notations used in MLP
For ease and standard use cases, we use notation.
2.1 Layers & Neurons
2.2 Weights W_ij
Weights are the strength of the connection between neurons.
Notation:
2.3 Bias b_j
2.4. Activation Function σ
2.5. Output Computation
The final output layer applies a function to transform hidden layer activations:
y_k = σ_out ( Σ W_jk * a_j + b_k ). Here K is the output layer where σ_out is the activation function of the output layer.
2.6. Forward Propagation Equation (General Form)
We know that it is a fully connected feedforward neural network where the information is processed in the forward direction and then mathematically forward propagation.
For a neuron in any layer l:
z_j^(l) = Σ W_ij^(l) * a_i^(l-1) + b_j^(l)
a_j^(l) = σ(z_j^(l))
where: - z_j^(l) = Weighted sum before activation - a_j^(l) = Output after activation
3. What about BackPropagation and Weight Updates?
Backpropagation is the process used to update the weights in an MLP by computing the gradient (How much the loss (error) changes when we change the weights.) of the loss function (difference between the predicted output y` and the true output y). with respect to the weights. This ensures that the network learns by minimizing the error.
Error = desired output — actual output
error = y — y`
4. Handwritten Notes on Derivation of BackPropagation
I find it impossible to write down all the derivations without directly inserting the handwritten notes.
Before diving into core maths. The learner must have prior knowledge of Chain Rule, Partial Derivative, Derivative (Power, Sum, Product Rule) , Gradient (Slope), and activation function (sigmoid).
Recommended by LinkedIn
5. XOR Problem:
A perception works by plotting a single straight decision boundary to separate different classes. But the XOR truth table is
If we plot these points on a 2D plane, you’ll see that no single straight line can separate the 1s from the 0s. This is why a perceptron, which relies on a linear function, fails to classify XOR correctly.
6. Algorithm: Backpropagation for MLP Neural Network
1. Initialize Parameters: — Randomly initialize weights W_ih, W_ho and biases b_h, b_o.
2. For each training epoch:
1. Forward Propagation:
— Compute the hidden layer input:
h_hidden = X W_ih + b_h
— Apply activation (Sigmoid) to the hidden layer:
a_hidden = σ(h_hidden)
— Compute the output layer input:
h_output = a_hidden W_ho + b_o
— Apply activation (Sigmoid) to the output layer:
ŷ = σ(h_output)
2. Error Calculation:
— Compute error: error = y — ŷ
3. Backpropagation:
— Compute the gradient at the output layer:
δ_output = error ⋅ ŷ ⋅ (1 — ŷ)
— Compute the gradient at the hidden layer:
δ_hidden = δ_output ⋅ W_ho^T ⋅ a_hidden ⋅ (1 — a_hidden)
4. Gradient Descent Update: — Update weights and biases:
W_ho = W_ho — η ⋅ δ_output ⋅ a_hidden^T
b_o = b_o — η ⋅ δ_output
W_ih = W_ih — η ⋅ δ_hidden ⋅ X^T
b_h = b_h — η ⋅ δ_hidden
3. Repeat until convergence (or max epochs reached).
The intuition behind the MLP to solve the XOR Problem
The hidden layer has neurons that apply weights and biases to the input values and use an activation function (like the sigmoid function). This non-linear activation allows the network to transform the inputs into a space where a linear separation is possible. A hidden layer maps the inputs to a higher-dimensional space, enabling non-linear transformations.
Conclusion:
Adding hidden layers to a neural network, specifically a Multi-Layer Perceptron (MLP), enables it to solve problems that are not linearly separable, such as the XOR problem. The key to this is the non-linearity introduced by the hidden layers and their activation functions, such as the sigmoid function(0 or 1). These layers allow the network to transform the input space into a higher-dimensional space where complex decision boundaries can be learned. Each time we compute the backpropagation we are computing the gradient and minimizing the loss function. Ultimately we converge the weights(that we call training) that correctly classify the given XOR Input.
References:
Appendix
“The implementation of the Multi-Layer Perceptron for the XOR problem is available on Google Colab, where users can access and run the code directly. The full implementation can be found here.”