Multi-Layer Perceptron Learning

Multi-Layer Perceptron Learning

Foundation of Deep Learning | Neural Networks

Abstract

Perceptron Learning is the foundation of a single-layered network that can act as a classifier for linear problems. However, it fails to solve non-linear problems like XOR. Adding one or more extra layers of neurons converts the single-layered network to a Multi-Layered one, which can solve the non-linearity and make correct classification. This article dives into the introductory mathematics derivations and lab experiments to solve the XOR problem. Using one hidden layer between the input and output layer with an Activation function like Sigmoid MLP can learn the XOR Gate functions.

1. Introduction

Multilayer Perceptron Neural Network has stacked layer architecture. They are the Input, Hidden & Output Layers. The hidden layers are so-called because they are connected only to other network units. They are hidden from the“outside world.”[1]

Article content

There can be a ’n’ number of hidden layers but only a Single Input & Output layer. Notations like W_ij, σ_ij, and b_ij are used in MLP because it is a scalable architecture. There are two types of signal flows ie. activation function and error signal.

Article content

2. Notations used in MLP

For ease and standard use cases, we use notation.

2.1 Layers & Neurons

  • x_i → Inputs to the network, where i indexes the input features.
  • h_j → Neurons in the hidden layer, where j indexes the neuron in a given layer.
  • y_k → Outputs, where k indexes the output neuron.

2.2 Weights W_ij

Weights are the strength of the connection between neurons.

Notation:

  • W_ij → Weight from neuron I (previous layer) to neuron j (next layer).
  • Example: In a hidden layer, W_12 is the weight from the 1st input neuron to the 2nd hidden neuron.

2.3 Bias b_j

  • Bias is a learnable scalar added to the weighted sum before activation, allowing the model to shift the decision boundary (y-intercept).
  • b_j → Bias term for neuron j in a given layer.
  • Example: If a hidden neuron has bias b_2(j = 2), its activation becomes: z_2=W_12 x_1 + W_22 x2 + b_2

2.4. Activation Function σ

  • Definition: A function applied to the weighted sum to introduce non-linearity.
  • σ(z) → Activation function applied to the weighted sum z. Here j is the hidden layer.
  • Example: a_ j = σ(W_1j x_1 + W_2j x_2 + b_j)
  • Common functions: ReLU, Sigmoid, Tanh, Softmax.

2.5. Output Computation

The final output layer applies a function to transform hidden layer activations:

y_k = σ_out ( Σ W_jk * a_j + b_k ). Here K is the output layer where σ_out is the activation function of the output layer.

2.6. Forward Propagation Equation (General Form)

We know that it is a fully connected feedforward neural network where the information is processed in the forward direction and then mathematically forward propagation.

For a neuron in any layer l:

z_j^(l) = Σ W_ij^(l) * a_i^(l-1) + b_j^(l)

a_j^(l) = σ(z_j^(l))

where: - z_j^(l) = Weighted sum before activation - a_j^(l) = Output after activation

3. What about BackPropagation and Weight Updates?

Backpropagation is the process used to update the weights in an MLP by computing the gradient (How much the loss (error) changes when we change the weights.) of the loss function (difference between the predicted output y` and the true output y). with respect to the weights. This ensures that the network learns by minimizing the error.

Error = desired output — actual output

error = y — y`

  • y → Actual target value (ground truth)
  • y` → Predicted value (the output of the neural network)

Article content

4. Handwritten Notes on Derivation of BackPropagation

I find it impossible to write down all the derivations without directly inserting the handwritten notes.

Before diving into core maths. The learner must have prior knowledge of Chain Rule, Partial Derivative, Derivative (Power, Sum, Product Rule) , Gradient (Slope), and activation function (sigmoid).

Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content

5. XOR Problem:

A perception works by plotting a single straight decision boundary to separate different classes. But the XOR truth table is

Article content

If we plot these points on a 2D plane, you’ll see that no single straight line can separate the 1s from the 0s. This is why a perceptron, which relies on a linear function, fails to classify XOR correctly.

Article content

6. Algorithm: Backpropagation for MLP Neural Network

1. Initialize Parameters: — Randomly initialize weights W_ih, W_ho and biases b_h, b_o.

2. For each training epoch:

1. Forward Propagation:

— Compute the hidden layer input:

h_hidden = X W_ih + b_h

— Apply activation (Sigmoid) to the hidden layer:

a_hidden = σ(h_hidden)

— Compute the output layer input:

h_output = a_hidden W_ho + b_o

— Apply activation (Sigmoid) to the output layer:

ŷ = σ(h_output)

2. Error Calculation:

— Compute error: error = y — ŷ

3. Backpropagation:

— Compute the gradient at the output layer:

δ_output = error ⋅ ŷ ⋅ (1 — ŷ)

— Compute the gradient at the hidden layer:

δ_hidden = δ_output ⋅ W_ho^T ⋅ a_hidden ⋅ (1 — a_hidden)

4. Gradient Descent Update: — Update weights and biases:

W_ho = W_ho — η ⋅ δ_output ⋅ a_hidden^T

b_o = b_o — η ⋅ δ_output

W_ih = W_ih — η ⋅ δ_hidden ⋅ X^T

b_h = b_h — η ⋅ δ_hidden

3. Repeat until convergence (or max epochs reached).

The intuition behind the MLP to solve the XOR Problem

Article content

The hidden layer has neurons that apply weights and biases to the input values and use an activation function (like the sigmoid function). This non-linear activation allows the network to transform the inputs into a space where a linear separation is possible. A hidden layer maps the inputs to a higher-dimensional space, enabling non-linear transformations.

Article content

Conclusion:

Adding hidden layers to a neural network, specifically a Multi-Layer Perceptron (MLP), enables it to solve problems that are not linearly separable, such as the XOR problem. The key to this is the non-linearity introduced by the hidden layers and their activation functions, such as the sigmoid function(0 or 1). These layers allow the network to transform the input space into a higher-dimensional space where complex decision boundaries can be learned. Each time we compute the backpropagation we are computing the gradient and minimizing the loss function. Ultimately we converge the weights(that we call training) that correctly classify the given XOR Input.

References:

  1. Bermúdez, J. L. (2014). Cognitive science: An introduction to the science of the mind (2nd ed.). Cambridge University Press.
  2. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Prentice Hall.
  3. Kafley, Sabin. Lecture Notes on Neural Networks. Feb 24, 2025.

Appendix

“The implementation of the Multi-Layer Perceptron for the XOR problem is available on Google Colab, where users can access and run the code directly. The full implementation can be found here.”

To view or add a comment, sign in

More articles by Dilli Hang Rai

Others also viewed

Explore content categories