Linear Classifiers - Perceptrons

Anurag Halder

Published Nov 21, 2017

I'll try and handhold you gently into the world of classification, using one of the simplest approaches yet effective when applied in the right use cases. In case you are not aware of what "classification" means, I will leverage the Wikipedia documentation, where you can read about it to get started.

Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into "spam" or "non-spam" classes

Coming to the term "perceptron", humans have always admired the ability of their brains to achieve the complex tasks that they perform and have for some years now tried to replicate it by representing the actions in some logical mathematical functions. One of the earliest of such instances was a paper by Warren McCulloch and Walter Pitts in their 1943 , "A Logical Calculus of Ideas Immanent in Nervous Activity," (Read More for the history of perceptrons), we have since then come a long way, not necessarily meaning we have moved quite close to the objective to making machines human like.

To explain the underlying idea of a perceptron:

Can we make a simple classifier that can assign classes when a mathematical representation throws values higher or lower values than a threshold?

Let's break the above statement, one piece at a time, do not worry if this immediately doesn't make sense, we will address each in greater detail below.

Simple classifier: The simplest form of classifier would be the linear classifier, i.e. the algorithm tries to cut through the space of the distribution of the positive and negative examples using a straight line with it's best attempt for good classification. For this article we will stick to a linear classifier
Classes: Classes refer to each of the categories, samples can be segmented into, example being different family of flowers, like setosa, versicolor or verginica. Classes can be binary or multi-class. For this exercise we will choose a multi-class problem (3 classes to be precise) and for simplicity solve the problem using the One Vs All method
One Vs All method: Each iteration of classification we will assume we have just two classes, i.e. we want to separate setosa class from the rest (they being versicolor and verginica) similarly seperate versicolor from the rest (they being setosa and virginica)
Mathematical Representation: This is where we define the boundary and try and tweak it to achieve a good classification by adjusting weights, more later in the article
Threshold: Determining a constant beyond which if the boundary function returns values we classify as a positive class else a negative class
Good Classification: To understand if our model is doing good or not we need to define a cost function, which will compare the difference between the actual label vs what the model predicts and then make tweaks to our boundary function

Making Of The Classifier

For this article we will use the famous and easily available Iris dataset and try to classify the class of the flowers into the three labels: 'setosa', 'versicolor' and 'virginica'

The data set has 4 features of 50 samples for each as columns , measuring sepal width, sepal length, petal width and petal length respectively.

This feature set x is represented in a matrix format (150 x 4), where each row represents the feature set of a particular flower, the subscript represents the features and superscript the observation number for that feature, example x subscript 1 and superscript 49 represents the value of the sepal width of 49th flower. Along with this we will also have a matrix y (150 x 1) which would represent the label for each of the examples i.e. whether it is a setosa, versicolor or virginica.

Let us look at some plots of the flowers against each of the features and get an idea of how well our linear separator might do:

The first impression we get is that from every feature combination setosa stands out and it should not be a problem putting a boundary across classifying between setosa vs others. While at the very same time it is a bit difficult for the linear classifier to completely separate out versicolor and virginica. Having said that let's see the closest we can achieve.

Next we define a set of 4 random weights (1 x 4) for the 4 features. A matrix multiplication between the weights and the feature matrix (150 x 1) gives us the decision boundaries, i.e. for every observation we have quantified a value that will help us make the classification decision.

So we define a decision function called a step function and assign an observation a positive class whenever the value of z crosses a threshold value theta. We can make a small modification to our decision function by bringing in the theta to the side of z and assign a positive class whenever z - theta gets greater than or equal to zero and negative otherwise. We thus rewrite z as:

where theta is a new parameter and assume it is multiplied with x subscript 0 where it is always 1 and is called bias. To accommodate the bias term we add to the matrix of x a new column of all ones and a w0 or theta to our weight matrix. Next we need to measure who well our decision function did against the actual label and tweak the random starting weights in case of misclassifications.

We call this the error function, which ideally is the difference between the actual and the model output squared. So if the actual class is -1 and the model output is -1 then we have essentially zero error, while if the model output is +1 then the error is (1 -(-1)) squared i.e. 4. Ideally we would want to minimize the error function, which is essentially a zero error or a correct classification.

From the error function we learn that we need to tweak our weights to get the classification correct. Logically if we have misclassified -1 to +1 it means we have high weights and needs to be reduced to get to -1, while if we misclassify +1 to -1 it means we have to increase our weights to get to 1. The weight updation method is as follows:

To understand the structure of the equation let's again consider some use cases. A correct classification would not lead to any change in weights. For a misclassification of -1 as 1, we would want to reduce the weights to get a correct classification, from the representation the delta term would be 2*xi and given xi is positive it pushed wi down. Similarly a misclassification of 1 as -1, would require increasing the weights, in this case, -2*xi. The eta in the updation is called the learning rate (0.0, 1.0), we generally take a fraction of the delta to make sure we do not overshoot our optimal targets.

Ending Notes

The way I have implemented the solution follows an updation once all the examples have been run through and we want to minimize the overall error. The examples are read over and over again in batches called epochs, in each epoch, the overall error calculated, all weights adjusted and again an epoch run, till we can't reduce the error any further.

That essentially is the crux of a perceptron architecture. I have implemented a quick python code for this exercise. You can find it here [ GitHub link to Iris Perceptron]

To view or add a comment, sign in

Linear Classifiers - Perceptrons

Anurag Halder

Making Of The Classifier

Ending Notes

More articles by Anurag Halder

Others also viewed

Combining machine learning and mathematical optimization Article #2

Why Linear Regression Model Can’t Decide: The Multicollinearity Trap

Day 15 — XGBoost

SVM (Support Vector Machine)

Walkthrough a Random Variable

Machine Learning Part 1 (Linear regression and Gradient descent)

SUPPORT VECTOR MACHINE (SVM)

MatrixPencils – A Julia package for advanced linear system analysis

Sentence Level Topic Modelling of text data using Latent Dirichlet Allocation

SVD — Single Value Decomposition

Explore content categories

Making Of The Classifier

Ending Notes

More articles by Anurag Halder

I can barely 'RECALL' with enough 'PRECISION' and little 'SPECIFICITY' what is 'SENSITIVITY'!

Learning by Doing

A Basic Emotion Detection Model - Application

A Basic Emotion Detection Model

Validation - A Short Post

Maximum Likelihood Estimation

From Johnny English to James Bond

Not that Naive after all ! Text Mining for Beginners

R and Google Analytics: Link Them Up - Step 1

Others also viewed

Combining machine learning and mathematical optimization Article #2

Why Linear Regression Model Can’t Decide: The Multicollinearity Trap

Day 15 — XGBoost

SVM (Support Vector Machine)

Walkthrough a Random Variable

Machine Learning Part 1 (Linear regression and Gradient descent)

SUPPORT VECTOR MACHINE (SVM)

MatrixPencils – A Julia package for advanced linear system analysis

Sentence Level Topic Modelling of text data using Latent Dirichlet Allocation

SVD — Single Value Decomposition

Explore content categories