Machine Learning 101 - Perceptrons and the XOR Problem
Despite the very science fiction sounding names - the building blocks of Neural Networks - Perceptrons, are super simple to understand. This blog will describe how a perceptron is created and trained - looking at the simple mathematics that can be extended to solve a massive array of problems.
The initial problem
Imagine you work at a factory that produces pens. The pens are the sort of ballpoint pens with lids and leave the main manufacturing floor in little boxes on a conveyor belt in two pieces - on one side of the box the bodies of the pens, on the other the lids.
At this factory you are in charge of quality control - as the two components of the pens flow past your station, you are to make sure there is both a pen and a lid in each box. On your workstation is a little button, you are to push this button when a box missing either or both compartments passes by. Pressing this button shuts down the conveyor belt line and notifies someone to investigate.
First automation
After working away for awhile, you decide your job is rather boring and apt for automation. During your lunch break one day, you decide to install two sensors - one which watches for the pen bodies, the other monitoring the lids. If the sensor monitoring for the lids notices the lid is missing from the box, it sends a signal of 0, otherwise, it will send a signal of 1. Same for the sensor watching for bodies. Once both cameras have reported a 1, the box is deemed to be a valid one, and the conveyor belt continues to the next box.
This relationship is essentially an AND relationship - when inputs are true, the output is true. Otherwise, the output is false. In the case of an absence being noticed by either device, the alarm can immediately be set off, as there is no possible case where a box with a missing item can be valid, so we need not wait for the other sensor to report back.
The cool thing about this is it is a linear relationship - it can be drawn like this, with a clear separating line - any situation falling on one side of the line is okay, anything on the other is not - this line is known as a decision boundary.
A perceptron is one of the earliest, and most basic of machine learning algorithms, usually drawn as such:
Where the F values refer to the features of the decision - in our case there would be two - the presence of a pen body (0 or 1 for true or false), and the presence of a pen lid (again, , the W values are weights - adjustable parameters which are tweaked as a part of the learning process. When making a decision the perceptron multiplies each feature by its corresponding weight, then adds them all up. If the result is above a certain value (e.g. 0), then the perceptron will output true, otherwise, it will output false.
This equation (thanks to the magic of LaTeX and CodeCogs, by the way - I wish I had bothered learning LaTeX ages ago) is essentially a fancy way of saying multiply each input by its corresponding weight, add the result together, add this to a bias value, then if it is above 0, return true, otherwise return false (you may also be able to see how this gives us a linear line equation - with the two input values rotating the line about, and the bias shifting it over).
In our example, our sensors return a 1 signal if their corresponding item is present in the current box. This means, for a valid box our inputs will be [1,1] (i.e. a 1 for each sensor). For a box with only one item and not another we will see a [0,1] or a [1,0], and a box with no items will give us a [0,0]. There are an infinite number of candidate weights we can apply to our perceptron in order to get the correct output, but the simplest is [0.5,0.5], with a bias of -1 (in this example we will take any value greater than or equal to 0 as 'activating' the perceptron).
Here we can see the positive example (that is, a case where both the lid and the body was detected) moving through the perceptron. Our input vector is [1,1], our weights were set to [0.5,0.5], our bias is set to -1, and the perceptron will 'activate', or return a positive value when the dot product of the input vector and the weight vector plus the bias is greater than or equal to 0. In our example the perceptron activates:
It should be relatively intuitive that the perceptron will correctly not activate in the other 'invalid box' cases, as the values will not overcome the perceptron's bias. Eg:
That final condition will not evaluate to true, therefore the perceptron remains off.
Perceptron Learning Worked Example
So our perceptron is able to correctly classify the boxes on our conveyor belt! However, this was a very simple example with only two equally weighted features. There are any number of situations with lots of features, each with potentially difference importance (or not important at all!) - if we were to classify a problem with 10, 20 or 100 features using a perceptron, it probably would get very tedious trying to find all the correct feature weights! Luckily this is machine learning - we can get our algorithm to learn the weights for us - if we're lucky enough to have a full dataset with relevant labels (labels here refer to the desired output of the algorithm). In our pen example the feature data would be an example of a scenario the perceptron might see - like a pen lid being present and the body being absent. The label would be the desired output - in this case, a non-activation, or 0. So let's start with a labelled dataset:
Next, we'll create our perceptron, initialising the weights to small random values (for the sake of illustration, lets set the bias to 2).
Now that's all set up, we'll run each of our training examples through the perceptron, adjusting the weights as we go. If the perceptron gets the answer correct, great! Move on to the next one. If not, then the weights are adjusted according to the perceptron learning rule:
A learning rate is just a small number used to slow down the rate of adjustment, preventing us from overshooting our mark (I like to picture this as controlling the 'resolution' of the decision boundary). We'll use 0.1 as our learning rate.
We're now ready to begin training our perceptron! You would run the evaluation on each sample, repeating the process until convergence is reached (or giving up after a certain number of iterations - more on this later). Lets work through the first iteration together. Beginning with the first sample:
So we've fed [0,0] into our perceptron, and are wanting to see a non-activation. You can probably see that due to the large bias value we're not going to see the value we're hoping for, but let's work it anyway:
Multiplying the inputs with the weights and adding them to the bias gives us 2, which we then compare to the activation threshold, 0, so our perceptron activates, returning 1. However, we were looking for a 0, so we need to apply the learning rule to the weights:
Adding the learning adjustment of 0 to our weight of 0.1 gives us, well... 0.1. Not particularly exciting. Repeating the process for the second weight (using the same sample) we get a rather similar result:
Again adding the adjustment of 0 to the weight of 0.3 gives us back the original weight. Our adjustments complete, we move on to the next sample:
Again, you may be able to see we're not going to get the answer we're expecting, but lets work through it:
Again our perceptron activates, and again it's incorrect, so again lets apply the learning rule:
Now this is more interesting! We will update the weight of 0.1 by adding the delta of -0.1, making the new weight 0. Now importantly we aren't actually going to apply the change until all the other weights (well, the one other weight in our case) have been assessed:
The same as before, so the second weight remains as it was. Before continuing on to the next example we'll update the weights:
On to the next sample:
Lets assess the sample using our newly updated perceptron:
Our perceptron is again activating, but we were expecting a non-activation, so lets apply the learning rule, starting with the first weight:
Doesn't look like the first weight will be changing this time. Lets take a look at the second weight:
Now, it looks like the second weight will be updated. Now we're done assessing the weights lets update them:
Cool! You might be able to see a pattern emerging here - it seems like a weight is adjusted when: 1) the output is incorrect, and 2) weight's input is greater than 0 (you can probably see that any non-zero input will actually satisfy this condition). The result is scaled by both the difference between the actual and expected outputs, and the learning rate. Before we move on to some final notes, we have one more sample to look at:
This scenario's label is different from the others in that it is actually looking for an activation. Lets run it through the perceptron:
The perceptron activates again, but with this sample we were actually looking for an activation, so we wont be updating the weights. It wouldn't matter if we applied the learning rule here anyway, as part of the scaling looks at the difference between the expected and actual outputs, which will be 0 when the two values match:
So no changes need to be made to the weights here. Next in the algorithm the process repeats, and continues to repeat until convergence is reached - which will occur in any linearly seperable dataset as long as an appropriate learning rate is selected (a high learning rate will mean things move quicker, but could also mean the decisision boundary repeatedly 'overshoots' the correct decision boundary position).
The XOR Problem and Why We Need a Bias Value
If you're still with me now, you're doing fantastically! I just have two more things to mention before signing off: this notion of linear separability, and why we want to have a bias value. Firstly, if you're anything like me, you would have seen the bias value and wondered why it is there at all - it seems to do nothing in the best case, and in fact, in our example, it seems to actually slow down the rate of learning. True, it would have been better to set the bias to a lower value, but the bias definitely has a purpose - it moves the line away from the origin, and means that an input vector filled with zeros is able to activate the perceptron. Without it, there is no possible scenario where 0 input vectors can ever do so.
Earlier we described our scenario as looking for cases where both the pen lid and pen body are present. What about if we decided that empty boxes are also a positive scenario?
Now, where would you draw a decision boundary line to separate the outputs? There's no linear way to do it! There is, however, another solution:
We draw two decision boundaries, train two perceptrons to them, then train a third to the results of the other two:
This is known as a perceptron network, otherwise known as a neural network! As more layers are added, more complex scenarios can be mapped, with an arbitrary number of inputs, and outputs (going beyond our simple binary true/false output to multiple different possible output options, each with their own output perceptron). In fact, multiple layers in between the leftmost layer of perceptrons and the rightmost layer of perceptrons form the basis for deep learning. The construction and training of which is a bit beyond the scope for this blog, but you now have the basic building blocks of one the coolest algorithms in computer science!