A behavioral comparison of binary classifiers
The past seven decades of research in Computer Science and Engineering produced a vast amount of knowledge on classification, with algorithms that use either online or batch learning techniques. As an example, many variants of the Perceptron [1] were invented ([2], [3], [4], [5]) but also other, more recent techniques were introduced such as KNN, Support Vector Machines, Random Forest etc...
With such a wide range of options, the task of selecting the most adequate method can be overwhelming. Therefore, serious practitioners invest a significant amount of time into learning the specifics of each method by looking at complexity, performance, limitations or even the assumptions made on the data. And in the absence of a truly comprehensive comparative study, practical experience becomes necessary to acquire the needed skills.
Out of curiosity, I decided to test a few algorithms on my own, and compare their behaviors under different conditions.
The algorithms
For this experiment, I chose four classification algorithms:
- Perceptron
- Averaged Perceptron
- Adapted Pocket algorithm
- Pegasos
Perceptron
In its most basic form, this algorithm [1] is well-known and represents therefore a great candidate to benchmark the other algorithms. It also exhibits the below properties:
- It converges to a solution in the linearly separable case; as demonstrated by Novikoff in 1962 [6]
- In the non-separable case, it has an upper bound on the number of mistakes as demonstrated by Mohri, Mehryar, Rostamizadeh and Afshin in 2013 [7]
- It typically has an erratic (therefore non-optimal) convergence trajectory, meaning that it often diverges multiple times from the final solution before eventually re-converging.
Averaged Perceptron
This algorithm produces an average of the weights normally produced by Perceptron. In my implementation, the average is computed on all the weights indiscriminately, and not only on the distinct ones (as in other implementations).
The most important expectation from this algorithm is that it offers a more "stable" behavior than Perceptron by attenuating the rapid and unpredictable peeks and deviations from the final solution. It also inherits the convergence properties of Perceptron.
Adapted Pocket algorithm
This algorithm is based on the Pocket Algorithm with Ratchet proposed by Gallant in 1990 [5]. In my adaptation, the so-called "Pocket" keeps track of the highest performing weights, meaning the ones that yield the lowest training error.
With that particular logic, the "divergence" behavior of Perceptron is eliminated because the algorithm is "forced" to either maintain or reduce the gap towards the final solution.
Note that this algorithm needs to leverage the entire training data upfront in order to determine the worth of each new set of weights. Therefore, unlike the first two algorithms, it is not an "online" learning method. However it serves as a great reference point to gauge the deviations of the other algorithms from an optimal convergence trajectory.
Pegasos
This algorithm is based on the work of Shalev-Shwartz, Singer, Srebro and Cotter published in 2007 [8]. And in a nutshell, it uses a Stochastic Gradient Descent (SGD) approach to update the weights at each step. But, unlike more traditional methods, it introduces a carefully designed decaying learning rate.
In my implementation, the loss function is based on the simple hinge-loss, and the learning rate is formulated as the inverse of the square root of the iteration index. Therefore, as the number of iterations increases, the learning rate decreases.
Finally, for simplicity reasons, the regularization parameter is set to zero because we simply aim at observing convergence and not at generalizing the results.
The reason this algorithm has been chosen is because it is representative of an entirely different group of classification algorithms known as Support Vector Machines. And it would be interesting to compare it with the other Perceptron-based algorithms.
The testing framework
Now that the algorithms have been introduced, let's clarify how they are compared with each other. The answer is simple: with the training error.
Indeed, after processing each data point, the algorithms produce outputs (weights) which are then used to calculate the training errors against the entire dataset.
At each step, these training errors are recorded in order to observe the convergence of our algorithms.
In this context, "convergence" means that the value remains within a certain neighborhood of a final value, after a certain number of iterations. In the best case scenario, the final value is zero, which would mean that the algorithm found a solution to the classification problem.
If an algorithm is capable of reaching zero (or minimum) training errors within the fewest number of iterations, we would consider that algorithm to be the most efficient.
To visualize the evolution of the training errors, we plot them as a graph against the number of iterations. (y-axis for training error, and x-axis for the number of iterations)
Below is an example of such a graph, where each color represents a different algorithm.
Without being specific at this stage, we can clearly see in the below picture that the green graph goes below the blue graph. Therefore, we may conclude that the "green" algorithm performs better than the "blue" one.
The experiment
In this experiment, I run the algorithms against three datasets representing three distinct data conditions.
These datasets are artificially generated with random number generators which adhere to specific probability laws. By doing so, we gain a higher degree of control over the experiment's conditions.
For simplicity and representation purposes, the data is two-dimensional and only two labels/categories are used (therefore we are limiting our tests to binary classification).
Two bi-variate normal distributions, centered around the cartesian coordinates (0,0) and (4,3), are used to produce the clusters of points. Each cluster representing a label/category. The separability is controlled by adjusting the variance of the distributions.
A total of 1000 data points are generated (500 of each category) for each dataset, with distribution variances respectively equal to 0.5, 2.0 and 4.0.
After the generation, the data points are randomly shuffled (re-sequenced) before getting processed by the algorithms.
First scenario
The first scenario represents the linearly separable case, where we know in advance that the algorithms converge. However we wish to observe how efficiently this convergence is achieved.
Second scenario
The second scenario represents a mildly non-separable case, where we know that a solution with minimal errors exists. The question is: how would such conditions impact the convergence behaviors ?
Third scenario
Finally, the third scenario represents the highest degree of non-separability where a large portion of the data points are intermixed. The intent is to stress test the algorithms and determine which one is more resilient or efficient in such conditions.
Results
Scenario 1
Before commenting on the result, let me first describe the layout:
- As you can see the X-axis goes from 0 to 1000 which the data points, as sequenced in the data set. We limit our observation to a single epoch.
- The graph plots the training error immediately after processing of a data point
- The red graph represents Perceptron, the blue represents Averaged Perceptron, the yellow represents the adapted Pocket algorithm and finally the green one represents Pegasos.
The first observation is regarding the red graph (Perceptron). As you can see, there are many sharp peeks, and with such a behavior it is difficult to determine whether the algorithm is converging or not. We know theoretically that it converges, but sometimes to achieve this convergence we would need to re-process the dataset multiple times (running multiple epochs).
In this instance and after 1000 iterations Perceptron produces the lowest performance with a training error above 0.05 (5% of the data points would be misclassified)
On the other hand, the blue graph (Averaged Perceptron) behaves as expected by smoothing the peeks of Perceptron. The result is better than Perceptron (around 0.02) after 1000 iterations.
Finally, both the yellow (Pocket) and green (Pegasos) graphs reach the expected zero training errors within less than 500 iterations. They both performed better than the first two.
We must remember however that Pegasos uses a a fully online learning method unlike the Pocket algorithm.
Scenario 2
Again, the Perceptron (red) algorithm gives us the lowest performance with a training error above 0.06 after 1000 iterations.
On the other hand, the Averaged Perceptron (blue) displays a behavior of dropping the training error progressively as the number of iterations increases. It drops all the way down to around 0.045 after 1000 iterations. It still performs better than Perceptron .
The Pocket algorithm (yellow) again outperforms the first two, reaching training errors right below 0.04 after 1000 iterations.
Finally, Pegasos exhibits a different behavior this time with peeks. The curve is indeed not as smooth as it was in Scenario 1. However, two observations are worth mentioning:
- It achieves the lowest training error amongst all algorithms within the first 100 iterations.
- It outperforms the Pocket algorithm (yellow) over the majority of the graph.
Scenario 3
These results here are very clear, the Pegasos algorithm (green) is performing better than the other three algorithms in those conditions.
We can still see the peeks (or instability), however the algorithm reaches training errors which are half as low (around 0.12) within about half as many iterations (~100). On average, Pegasos maintains a low training error throughout the entire epoch.
These results show a distinctive superior performance.
Conclusion
Although Perceptron and its variants are powerful methods. The experiment shows that Pegasos, considered to be a Support Vector Machine, yields a superior performance within fewer computational cycles.
This is an example of the great contributions made by researchers over the past few decades.
Note that the Perceptron was initially conceived to be a model for a biological neuron, and its power truly lies in connecting it to multiple other Perceptrons. By doing so we create what is referred to as multilayered perceptrons or Artificial Neural Networks.
The code for this experiment is posted at this location on GitHub. It will enable you to re-generate the data, re-run the algorithms and plot the results. Please share with me your comments, conclusions or feedbacks.
References
[1] Rosenblatt, Frank (1957). "The Perceptron—a perceiving and recognizing automaton". Report 85-460-1. Cornell Aeronautical Laboratory
[2] Anlauf, J. K.; Biehl, M. (1989). "The AdaTron: an Adaptive Perceptron algorithm". Europhysics Letters. 10 (7): 687–692.
[3] Freund, Y.; Schapire, R. E. (1999). "Large margin classification using the perceptron algorithm". Machine Learning. 37 (3): 277–296.
[4] Dekel, Ofer; Shalev-Shwartz, Shai; Singer, Yoram (2008). "The forgetron: A kernel-based perceptron on a budget" . SIAM Journal on Computing. 37 (5): 1342–1372.
[5] Gallant, S. I. (1990). Perceptron-based learning algorithms. IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 179–191
[6] Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12, 615–622. Polytechnic Institute of Brooklyn.
[7] Mohri, Mehryar and Rostamizadeh, Afshin (2013). Perceptron Mistake Bounds arXiv:1305.0208, 2013.
[8] Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In: Proceedings of the 24th International Conference on Machine Learning, pp. 807–814 (2007)