Linear Classifiers - Perceptrons

Linear Classifiers - Perceptrons

I'll try and handhold you gently into the world of classification, using one of the simplest approaches yet effective when applied in the right use cases. In case you are not aware of what "classification" means, I will leverage the Wikipedia documentation, where you can read about it to get started.

Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into "spam" or "non-spam" classes

Coming to the term "perceptron", humans have always admired the ability of their brains to achieve the complex tasks that they perform and have for some years now tried to replicate it by representing the actions in some logical mathematical functions. One of the earliest of such instances was a paper by Warren McCulloch and Walter Pitts in their 1943 , "A Logical Calculus of Ideas Immanent in Nervous Activity," (Read More for the history of perceptrons), we have since then come a long way, not necessarily meaning we have moved quite close to the objective to making machines human like.

To explain the underlying idea of a perceptron:

Can we make a simple classifier that can assign classes when a mathematical representation throws values higher or lower values than a threshold?

Let's break the above statement, one piece at a time, do not worry if this immediately doesn't make sense, we will address each in greater detail below.

  • Simple classifier: The simplest form of classifier would be the linear classifier, i.e. the algorithm tries to cut through the space of the distribution of the positive and negative examples using a straight line with it's best attempt for good classification. For this article we will stick to a linear classifier
  • Classes: Classes refer to each of the categories, samples can be segmented into, example being different family of flowers, like setosa, versicolor or verginica. Classes can be binary or multi-class. For this exercise we will choose a multi-class problem (3 classes to be precise) and for simplicity solve the problem using the One Vs All method
  • One Vs All method: Each iteration of classification we will assume we have just two classes, i.e. we want to separate setosa class from the rest (they being versicolor and verginica) similarly seperate versicolor from the rest (they being setosa and virginica)
  • Mathematical Representation: This is where we define the boundary and try and tweak it to achieve a good classification by adjusting weights, more later in the article
  • Threshold: Determining a constant beyond which if the boundary function returns values we classify as a positive class else a negative class
  • Good Classification: To understand if our model is doing good or not we need to define a cost function, which will compare the difference between the actual label vs what the model predicts and then make tweaks to our boundary function

Making Of The Classifier

For this article we will use the famous and easily available Iris dataset and try to classify the class of the flowers into the three labels: 'setosa', 'versicolor' and 'virginica'

The data set has 4 features of 50 samples for each as columns , measuring sepal width, sepal length, petal width and petal length respectively.

This feature set x is represented in a matrix format (150 x 4), where each row represents the feature set of a particular flower, the subscript represents the features and superscript the observation number for that feature, example x subscript 1 and superscript 49 represents the value of the sepal width of 49th flower. Along with this we will also have a matrix y (150 x 1) which would represent the label for each of the examples i.e. whether it is a setosa, versicolor or virginica.

Let us look at some plots of the flowers against each of the features and get an idea of how well our linear separator might do:

The first impression we get is that from every feature combination setosa stands out and it should not be a problem putting a boundary across classifying between setosa vs others. While at the very same time it is a bit difficult for the linear classifier to completely separate out versicolor and virginica. Having said that let's see the closest we can achieve.

Next we define a set of 4 random weights (1 x 4) for the 4 features. A matrix multiplication between the weights and the feature matrix (150 x 1) gives us the decision boundaries, i.e. for every observation we have quantified a value that will help us make the classification decision.

So we define a decision function called a step function and assign an observation a positive class whenever the value of z crosses a threshold value theta. We can make a small modification to our decision function by bringing in the theta to the side of z and assign a positive class whenever z - theta gets greater than or equal to zero and negative otherwise. We thus rewrite z as:

where theta is a new parameter and assume it is multiplied with x subscript 0 where it is always 1 and is called bias. To accommodate the bias term we add to the matrix of x a new column of all ones and a w0 or theta to our weight matrix. Next we need to measure who well our decision function did against the actual label and tweak the random starting weights in case of misclassifications.

We call this the error function, which ideally is the difference between the actual and the model output squared. So if the actual class is -1 and the model output is -1 then we have essentially zero error, while if the model output is +1 then the error is (1 -(-1)) squared i.e. 4. Ideally we would want to minimize the error function, which is essentially a zero error or a correct classification.

From the error function we learn that we need to tweak our weights to get the classification correct. Logically if we have misclassified -1 to +1 it means we have high weights and needs to be reduced to get to -1, while if we misclassify +1 to -1 it means we have to increase our weights to get to 1. The weight updation method is as follows:

To understand the structure of the equation let's again consider some use cases. A correct classification would not lead to any change in weights. For a misclassification of -1 as 1, we would want to reduce the weights to get a correct classification, from the representation the delta term would be 2*xi and given xi is positive it pushed wi down. Similarly a misclassification of 1 as -1, would require increasing the weights, in this case, -2*xi. The eta in the updation is called the learning rate (0.0, 1.0), we generally take a fraction of the delta to make sure we do not overshoot our optimal targets.

Ending Notes

The way I have implemented the solution follows an updation once all the examples have been run through and we want to minimize the overall error. The examples are read over and over again in batches called epochs, in each epoch, the overall error calculated, all weights adjusted and again an epoch run, till we can't reduce the error any further.

That essentially is the crux of a perceptron architecture. I have implemented a quick python code for this exercise. You can find it here [ GitHub link to Iris Perceptron]

To view or add a comment, sign in

More articles by Anurag Halder

  • I can barely 'RECALL' with enough 'PRECISION' and little 'SPECIFICITY' what is 'SENSITIVITY'!

    I find it very difficult and unfair to remember jargon, anything that forces me to memorize generally fails me. To…

    4 Comments
  • Learning by Doing

    I always have been a strong proponent of learning by example, because I too understand by starting with a simple toy…

    4 Comments
  • A Basic Emotion Detection Model - Application

    Before I begin my current article, aiming to show a quick application use case, let me thank all my friends from the…

    14 Comments
  • A Basic Emotion Detection Model

    Very recently I was working on an open source computer vision project to classify, looking at human faces, into 7…

    26 Comments
  • Validation - A Short Post

    I have come across a lot of people in the field of data science, who believed that data for modeling should be split…

    2 Comments
  • Maximum Likelihood Estimation

    This article is intended as short write up for those who are not very clear about the concept of Maximum Likelihood…

    1 Comment
  • From Johnny English to James Bond

    I am sure once in a while you must have seen a 'Missing' poster either in a tree/lamp-post/walls or some public place…

    2 Comments
  • Not that Naive after all ! Text Mining for Beginners

    I have at times been approached by fellow mates and friends, asking me about my opinion and experience of Text…

    4 Comments
  • R and Google Analytics: Link Them Up - Step 1

    I am sure most of us who have been exposed to R and Google Analytics have wished some time or the other to be able to…

Others also viewed

Explore content categories