The Evolution of Boosting Algorithms
Computer Science Tree by Midjourney

The Evolution of Boosting Algorithms

Decision Trees are used in statistics, data mining and machine learning and they are a supervised learning method which can be applied in both classification and regression. But the Decision Trees can be improved using boosting as it was first described by Schapire in his paper “The Strength of Weak Learnability “[1]. Basically, a boosting algorithm is a learning algorithm that will take advantage of the weak learners in order to generate high-accuracy hypotheses. However, over the years the algorithm has been improved and adapted by various contributors. The fact that the algorithm suffered a series of mutation that lead to algorithms like XGBoost, AdaBoost, Gradient Boost, LightGBM, is proof that the main idea has passed “the test of time”.

Table Of Content

· Decision Trees
· Boosting Algorithms
 ∘ AdaBoost
 ∘ Gradient Boosting
 ∘ XGBoost
 ∘ LightGBM
· Concluding Remarks
· Bibliography

The boosting algorithm has its roots in the model of learning called distribution free or probably approximately correct (PAC) that was described by Valiant [2]. In the model the learner must find a prediction rule that correctly classifies an unknown concept as positive or negative using examples of the same concept based only on fixed examples of that concept but chosen randomly. However, in his paper Valiant [2] hinted that weak and strong distribution should also be separated but Schapire [1] proved that they can be equivalent in the distribution-free if it is used the boosting algorithm. Initially the algorithm described would take advantage of weak learning algorithm by running it several times using different distribution of instances resulting in several different hypotheses. These hypotheses will be combined using the boosting algorithm in order to create a bigger hypothesis with an increased accuracy. The generation of distribution is done using a process called “filtering”. As the name suggests, this process will discard a part of the examples (chosen randomly) that are fed to the boosting algorithm and it will allow only a subset of examples to enter the weak learning algorithm. As it was acknowledged by important members of the academic society the result redefined the upper bounds on the time and space complexity of distribution-free learning. This was the first form of the boosting algorithm and it was a breakthrough for that time, but over the past years a lot of variations and implementations of the algorithm appeared and the base idea suffered mutations. This evolution had piqued my interest and, in this paper, I will make a summary of the most popular boosting algorithms on decision trees with their strengths and weaknesses. The topic is important because I’m sure that everyone remembers the popular game that we used to play in our childhood (telephone charades), and the whole idea of misinterpretations. 

Decision Trees 

The whole point of boosting is to improve the performance of Decision Trees. The Decision Trees mimics the real-life model of a tree where the leaves are nodes and the branches and connections and we chose the branch based on the leaves. The idea behind it is that we have a set of attributes that describes an example. We have to pick a feature, based on some criterion and we split the example into two subsets that will not contain examples of just one class and we keep on dividing the subsets until each subset has example of one class and the terminal node will make the final decision if it’s class 0 or 1. So the decision will start from the root node and based on the characteristics of the example it will always chose a branch therefore navigating the whole tree until it reaches the final destination (a terminal node ) that will make the decision. Decision Trees have a number of well-known advantages. One of the biggest advantages is that a Decision Tree requires less effort in the pre-processing state of data preparation, also it doesn’t require scaling of data or normalization of data. Another advantage is that the model itself is intuitive and the missing data that can occur will not affect the process of tree building. They can be used in machine learning, data mining, statistics, therefore the performance boost is crucial. In the paper “Boosting Decision Trees “ [3] Mr. Drucker and Mrs. Cortes had conducted a series of experiments where they used 120,000 examples in a NIST database of digits sub-sampled for 10x10 pixel array (100 features) where the features are continuous values, 10,000 training examples, 2000 pruning examples and 2000 test examples for a total of 14,000 examples in order to see how efficient can boosting be. The results were amazing, for an error rate of .1 a single tree would be 12% and for boosting trees would be 3.5% and with a number of trees required to reach the error rate for that ensemble of 25. An observation that they made was that “Overtraining never seems to be a problem for these weak learners, that is, as one increases the number of trees, the ensemble test error rate asymptotes and never increases. “ therefore, consolidating the advantages of Boosting Algorithms. It appears that the usage of trees as weak learners gives a huge performance to single trees, way better than any of the traditional technique for tree construction. An important note is that the weak learner must have an error rate of less than 0.5 on a training set that has been filtered. 

Boosting Algorithms

We’ve seen the importance of boosting and how it had an important impact on the way that Decision Trees are constructed, on the other hand, we haven’t talked about the boosting algorithms and how they’ve improved over time. The most popular mutations of the classic boosting algorithm are:

AdaBoost

The algorithm appeared only 5 years after the initial publication of the original boosting algorithm and it was invented by Freud and Schapire [4] in order to solve the practical difficulties that the original algorithm had. In that context the previous weak learner was referred as a “stump”- basically a one level Decision Tree. In simple terms the AdaBoost algorithm had 3 main ideas: The classification is made by uniting a bunch of stumps. Some stumps are more important than others and they affect more the final result. This is at each step by deciding how well it classified the sample. Initially all stumps have equal weights which adds up to 1, meaning that the total error will always be between 1 and 0, 1 — for a perfect stump and 0 — for a horrible stump. We determine the importance of a stump in the decision with the formula: [4] By each iteration the stump will evolve because it is constructed by taking previous stump’s errors and mistakes into consideration. This will be done by changing importance of a stump by increasing or decreasing it with the formula [4] The AdaBoost is definitely an improvement from the original algorithm as it has a number of advantages like the lack of parameters to tune, excluding the number of rounds. It’s lack of the necessity to know about the weak learners allows it to be combined with any method for finding weak learners and its ability to find examples that are either mislabeled in the training data, or which are inherently ambiguous and hard to categorize (outliers) [4]

Gradient Boosting

It didn’t take long before someone came with another approach on the boosting algorithms, so in 1997 Leo Breiman has laid the foundation of Gradient Boosting with his idea that boosting can be interpreted as an optimization algorithm on a suitable cost function [5]. There he introduced a function called the “edge” [1], which differs from the margin only if there are more than two classes. This had a great impact on the way boosting algorithms in decision trees were viewed. Unlike AdaBoost the Gradient Boosting would not optimize the weak learned based on his misclassified observation but based on a loss function. This optimization can be made using techniques like Gradient Descent. Another great difference is that the ensemble of trees is built gradually and the sum of the individual trees is done sequentially. It relies on the next tree to recover the loss. This algorithm came with a series of advantages like the ability to support different loss functions (like the least square). However, as it was presented in the paper “Avoiding Boosting Overfitting by Removing Confusing Sample “ the Gradient Boosting has one big disadvantage — Overfitting. But according to the paper “Overfitting in boosting seems to occur only when target distributions overlap or the noise is present, thus boosting could show no overfitting on some datasets and overfit greatly on others” [6]

XGBoost

Since 1997 the Boosting Algorithm had not received a notable evolution until Tianqi Chen and Carlos Guestrin released the research article “XGBoost: A Scalable Tree Boosting System “in 2016 [7]. The whole idea of the XGBoost was to optimize the Gradient Boosting algorithms. The main difference was that XGBoost came up with some innovations like: “a novel tree learning algorithm is for handling sparse data; a theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed computing makes learning faster which enables quicker model exploration” [7]. So basically XGBoost leveled up the game once more with: Parallelized tree building Cache awareness and out-of-core computing Tree pruning using depth-first-approach Regularization for avoiding Overfitting

LightGBM

One year after the breakthrough of XGBoost, The Microsoft Machine Learning Team with Guolin Ke as main contributor released the paper “LightGBM: A Highly Efficient Gradient Boosting Decision Tree” [8]. In this paper the new boosting method would be called LightGBM and it would improve his predecessor XGBoost by adding 2 new concepts: Gradient-based One-Side Samplin or GOSS — the algorithm would keep” those instances with large gradients (e.g., larger than a pre-defined threshold, or among the top percentiles), and only randomly drop those instances with small gradients” [8] Exclusive Feature Bundling, or EFB that would “bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features” [8] 

Concluding Remarks

By all counts, and with the proven results, it is obvious that the idea of a Boosting Algorithm has suffered mutation. Each time the algorithm evolved and it became better and faster but it lost the initial form, so is it safe to assume that each implementation is a variation of the initial algorithm? Or we should take in consideration the fact that the root idea and objective had remained unchanged and call it a linear evolution. We’ve seen how the simple idea of Schapire evolved over the years, this being directly proportional with the hardware evolution. But one thing remained unchanged, the fact that a Boosting Algorithm will improve the performance of a Decision Tree, and in the, exactly like in a tree, each algorithm, research paper or idea will be the root for a new generation.

Bibliography

[1] Schapire (1990), The strength of weak learnability, Machine Learning, 5(2): 197–227 

[2] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, November 1984

[3] Harris Drucker, Cotia Cortes, Boosting Decision Trees (1995), Advances in neural information processing systems 8:479–485 

[4] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, August 1997. 

[5] Leo Breiman, “ARCING THE EDGE” (1997), Technical Report 486, Statistics Department University of California, Berkeley CA. 94720 

[6] Alexander Vezhnevets, Olga Barinova, “Avoiding Boosting Overfitting by Removing Confusing Sample” (2007), DOI: 10.1007/978–3–540–74958–5_40 · Source: DBLP 

[7] Chen, Tianqi and Guestrin, Carlo, “XGBoost: A Scalable Tree Boosting System “ (2016), Association for Computing Machiner, ISBN (9781450342322) 

[8] Guolin Ke , Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu, “LightGBM: A Highly Efficient Gradient Boosting Decision Tree “, Advances in Neural Information Processing Systems 30 (NIPS 2017)

To view or add a comment, sign in

More articles by Eduard Melu

  • In-Depth Guide of Azure Worker Roles

    The cloud is powering a lot of successful businesses that help millions of people all around the globe. The goal for…

  • The parallel model that mimics the web

    In the era of technology, things move fast, very fast, frameworks and technologies are constantly evolving, some of…

    2 Comments

Others also viewed

Explore content categories