Boosting
In the last article we introduced an ensemble algorithm based on Bootstrap technology. In this article, we will introduce another popular ensemble algorithm called Boosting. Similar to the Bagging algorithm, Boosting also create a strong classifier from a number of weak classifiers. The difference is that Boosting builds a weak classifier from the training data, then creating a second classifier that attempts to correct the errors from the first one. For doing that, it adjusts the weights of the data points iteratively according to the performance of each weak classifier, which affects the probability that the data points are selected for the next training set or directly adjust the prediction result (depending on whether those weights can be directly used by the algorithm of the weak classifiers). The weak classifiers will also be given different weights based on the accuracy of their predictions and those weights will be taken into account when creating the strong classifier. In addition, the weak classifiers in Bagging are not associated with each other so that they can be trained in parallel, but weak classifiers of Boosting have to be trained sequentially since the weights of the data points depend on the prediction result of the previous classifiers. The weak classifiers are added until the training set is predicted perfectly or a maximum number of classifiers are added.
Adaptive Boosting, also known as Adaboosting, was the first really successful boosting algorithm developed for binary classification. It can be used to boost the performance of any machine learning algorithm, but it is normally used with weak classifiers. The weak classifier here is an algorithm that is slightly better than a groundless random guess (the accuracy is about 50%), such as making a prediction based on one attribute (i.e. if height > 170cm, predicts the gender as male, otherwise female). Decision stump is a typical weak classifier. Compared with decision tree, it has only one level and use one input attribute to make a binary classification.
The following is a detailed implementation of Adaboosting algorithm:
1. At the beginning, the weights of all data points are same, that is, weight(Xi) = 1 / n, where Xi is the i-th data point in the training set and n is the number of data points.
2. And then, we will train a new weak binary classifier (decision stump), which predicts based on one input attribute and output 1 or -1 representing the binary classification results. It, then, compares the prediction with the actual value and calculate the error of the weak classifier with the following function: error = sum(Werror) / sum(W), that is, the added weights of the misclassified data points divided by the added weights of all data points. Apparently, error is a decimal and it is directly proportional to the number of the misclassified data points. For example, given the prediction is (1, -1, 1) and the real value is (1, 1, 1), if the weights of the three data points are same and equals to 1/3, the error is (1/3) / (1/3 + 1/3 + 1/3) = 1/3. If the weights are 0.1, 0.7, 0.2, the error rate is 0.7 / (0.1 + 0.7 + 0.2) = 0.7.
3. Now, we can calculate the weight of this weak classifier by the following function: Wc = 0.5 * ln((1-error) / error), in which the error is what we just got in the previous step. Below is the graph of this function:
Through above graph we can see that the weight function of the weak classifier has two effects:
a) Turns the error in decimal (range: [0, 1]) into a number in the range of (-∞, +∞) so that this number can be used for increasing and decreasing the weights of the weak classifiers and the individual data points.
b) Set error = 50% as a cut-off point. If error < 50%, the result of the function (the weight of the weak classifier) will be a positive number, and the smaller the error, the larger the classifier’s weight, the more import the prediction of the classifier in the Boosting model. If error > 50% (unusual, because the accuracy of the weak classifier’s prediction is supposed to be better than 50%), the weight will be negative, which is also meaningful. It means to flip the classifier’s prediction so that we turn this “bad” classifier into a good classifier with error < 50%. When error = 50%, the weight is 0. That is because this kind of weak classifiers do not help us at all, we cannot get any tip from their predictions.
4. Then, we can update the weights of the individual data points with the weight of the classifier. The principle is that the misclassified data points get larger weight, otherwise the weight is reduced. The function is: w' = w * exp(Wc * t), in which w is the original weight of the data point, w’ is the updated weight, exp() represents the exponential function of the mathematical constant e, Wc is the weight of the classifier, and t is a flag value determined based on the prediction of the data point. When the data point is misclassified, t = 1, otherwise -1.
For misclassified data points, since e ≈ 2.7, when Wc > 0, w’ > w, which means the weights of the misclassified data points are increased. And the weights of the correctly classified data points are reduced.
When Wc < 0, the classifier’s prediction is flipped, the misclassified data points are actually classified correctly. As compensation, we reduce their weights by w * exp(negative number). And the rest data points’ weights are increased.
Furthermore, we normally renormalize the weights of all data points to ensure their sum equals to 1.
5. Unlike Bagging, the combined prediction (strong classifier) is not made by a majority voting. It is the weighted sum of the predictions of all weak classifiers in the model. If the result is greater than 0, the combined prediction is 1, otherwise -1.
6. Repeat above step 2 – 5, until:
- The accuracy of the strong classifier archives the expected value, or
- A maximum number of weak classifiers are added.
Like all ensemble algorithms, Boosting has the advantage of being less prone to overfitting. But since Boosting will keep increasing the weights of the misclassified data points, the outliers and noisy in the data tend to have a negative impact on the model. Thus, when preparing the training data, we should try to avoid these situations and improve the quality of the data. In addition, since Boosting is an iterative algorithm, its weak classifiers cannot be trained in parallel, thus its performance can also be a short board.