Machine Learning Algorithm : ensemble (part 7 of 12)
Logit Boost (Boosting) :
In machine learning and computational learning theory, Logit Boost is a boosting algorithm formulated by Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The original paper casts the AdaBoost algorithm into a statistical framework. Specifically, if one considers AdaBoost as a generalized additive model and then applies the cost functional of logistic regression, one can derive the LogitBoost algorithm.
Minimizing the LogitBoost cost function
LogitBoost can be seen as a convex optimization. Specifically, given that we seek an additive model of the form :
The LogitBoost algorithm minimizes the logistic loss:
Ref : http://svitsrv25.epfl.ch/R-doc/library/caTools/html/LogitBoost.html
BootStrapped Aggregation (Bagging):
Bootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method.
An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.
Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance. An algorithm that has high variance are decision trees, like classification and regression trees (CART).
Decision trees are sensitive to the specific data on which they are trained. If the training data is changed (e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite different and in turn the predictions can be quite different.
Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.
Let’s assume we have a sample dataset of 1000 instances (x) and we are using the CART algorithm. Bagging of the CART algorithm would work as follows.
- Create many (e.g. 100) random sub-samples of our dataset with replacement.
- Train a CART model on each sample.
- Given a new dataset, calculate the average prediction from each model.
For example, if we had 5 bagged decision trees that made the following class predictions for a in input sample: blue, blue, red, blue and red, we would take the most frequent class and predict blue.
When bagging with decision trees, we are less concerned about individual trees over fitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characterize of sub-models when combining predictions using bagging.
The only parameters when bagging decision trees is the number of samples and hence the number of trees to include. This can be chosen by increasing the number of trees on run after run until the accuracy begins to stop showing improvement (e.g. on a cross validation test harness). Very large numbers of models may take a long time to prepare, but will not overfit the training data.
Just like the decision trees themselves, Bagging can be used for classification and regression problems.
AdaBoost :
Boosting is an approach to machine learning based on the idea of creating a highly accurate prediction rule by combining many relatively weak and inaccurate rules.
The AdaBoost algorithm of Freund and Schapire was the first practical boosting algorithm, and remains one of the most widely used and studied, with applications in numerous fields. Over the years, a great variety of attempts have been made to “explain” AdaBoost as a learning algorithm, that is, to understand why it works, how it works, and when it works (or fails).
To review some of the numerous perspectives and analyses of AdaBoost that have been applied to explain or understand it as a learning method,with comparisons of both the strengths and weaknesses of the various approaches.
Stacked Generalization (blending):
Stacking, Blending and and Stacked Generalization are all the same thing with different names. It is a kind of ensemble learning.
In traditional ensemble learning, we have multiple classifiers trying to fit to a training set to approximate the target function. Since each classifier will have its own output, we will need to find a combining mechanism to combine the results. This can be through voting (majority wins), weighted voting (some classifier has more authority than the others), averaging the results, etc. This is the traditional way of ensemble learning.
In stacking, the combining mechanism is that the output of the classifiers (Level 0 classifiers) will be used as training data for another classifier (Level 1 classifier) to approximate the same target function. Basically, you let the Level 1 classifier to figure out the combining mechanism.
Gradient Boosting Machines (GBM)
Gradient boosting machines are a family of powerful machine-learning techniques that have shown considerable success in a wide range of practical applications. They are highly customizable to the particular needs of the application, like being learned with respect to different loss functions.
The response variable y can come from different distributions. This naturally leads to specification of different loss functions Ψ. In particular, if the response variable is binary, i.e., y ∈ {0, 1}, one can consider the binomial loss function. If the response variable is continuous, i.e., y ∈ R, one can use classical L2 squared loss function or the robust regression Huber loss. For other response distribution families like the Poisson-counts, specific loss functions have to be designed.
To make the problem of function estimating tractable, we can restrict the function search space to a parametric family of functions f(x, θ). This would change the function optimization problem into the parameter estimation one:
fˆ(x)=f(x,θˆ),θˆ=argminθEx[Ey(Ψ[y,f(x,θ)])|x]
One can arbitrarily specify both the loss function and the base-learner models on demand. In practice, given some specific loss function Ψ(y, f) and/or a custom base-learner h(x, θ), the solution to the parameter estimates can be difficult to obtain. To deal with this, it was proposed to choose a new function h(x, θt) to be the most parallel to the negative gradient {gt(xi)}Ni = 1 along the observed data:
gt(x)=Ey[∂Ψ(y,f(x))∂f(x)∣∣∣x]f(x)=fˆt−1(x)
Instead of looking for the general solution for the boost increment in the function space, one can simply choose the new function increment to be the most correlated with −gt(x). This permits the replacement of a potentially very hard optimization task with the classic least-squares minimization one:
(ρt,θt)=argminρ, θ∑i=1N[−gt(xi)+ρh(xi,θ)]2
Gradient boosting machines are a family of powerful machine-learning techniques that have shown considerable success in a wide range of practical applications. They are highly customizable to the particular needs of the application, like being learned with respect to different loss functions.
The response variable y can come from different distributions. This naturally leads to specification of different loss functions Ψ. In particular, if the response variable is binary, i.e., y ∈ {0, 1}, one can consider the binomial loss function. If the response variable is continuous, i.e., y ∈ R, one can use classical L2 squared loss function or the robust regression Huber loss. For other response distribution families like the Poisson-counts, specific loss functions have to be designed. More details on the types of loss functions are presented in the III section of the article.
To make the problem of function estimating tractable, we can restrict the function search space to a parametric family of functions f(x, θ). This would change the function optimization problem into the parameter estimation one:
fˆ(x)=f(x,θˆ),
θˆ=argminθEx[Ey(Ψ[y,f(x,θ)])|x]
One can arbitrarily specify both the loss function and the base-learner models on demand. In practice, given some specific loss function Ψ(y, f) and/or a custom base-learner h(x, θ), the solution to the parameter estimates can be difficult to obtain. To deal with this, it was proposed to choose a new function h(x, θt) to be the most parallel to the negative gradient {gt(xi)}Ni = 1 along the observed data:
gt(x)=Ey[∂Ψ(y,f(x))∂f(x)∣∣∣x]f(x)=fˆt−1(x)
Instead of looking for the general solution for the boost increment in the function space, one can simply choose the new function increment to be the most correlated with −gt(x). This permits the replacement of a potentially very hard optimization task with the classic least-squares minimization one:
(ρt,θt)=argminρ, θ∑i=1N[−gt(xi)+ρh(xi,θ)]2
Gradient boost regression trees:
Gradient Boosted Regression Trees (GBRT) is a flexible non-parametric statistical learning technique for classification and regression.
GBRT provide three knobs to control overfitting: tree structure, shrinkage, and randomization.
Tree Structure
The depth of the individual trees is one aspect of model complexity. The depth of the trees basically control the degree of feature interactions that your model can fit. For example, if you want to capture the interaction between a feature latitude and a feature longitude your trees need a depth of at least two to capture this. Unfortunately, the degree of feature interactions is not known in advance but it is usually fine to assume that it is faily low — in practice, a depth of 4-6 usually gives the best results. In scikit-learn you can constrain the depth of the trees using the max_depth argument.
Another way to control the depth of the trees is by enforcing a lower bound on the number of samples in a leaf: this will avoid unbalanced splits where a leaf is formed for just one extreme data point. In scikit-learn you can do this using the argument min_samples_leaf. This is effectively a means to introduce bias into your model with the hope to also reduce variance as shown in the example below:
def fmt_params(params):
return ", ".join("{0}={1}".format(key, val) for key, val in params.iteritems())
fig = plt.figure(figsize=(8, 5))
ax = plt.gca()
for params, (test_color, train_color) in [({}, ('#d7191c', '#2c7bb6')),
({'min_samples_leaf': 3},
('#fdae61', '#abd9e9'))]:
est = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=1, learning_rate=1.0)
est.set_params(**params)
est.fit(X_train, y_train)
test_dev, ax = deviance_plot(est, X_test, y_test, ax=ax, label=fmt_params(params),
train_color=train_color, test_color=test_color)
ax.annotate('Higher bias', xy=(900, est.train_score_[899]), xycoords='data',
xytext=(600, 0.3), textcoords='data',
arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
)
ax.annotate('Lower variance', xy=(900, test_dev[899]), xycoords='data',
xytext=(600, 0.4), textcoords='data',
arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
)
plt.legend(loc='upper right')
Random Forests
The introduction of random forests proper was first made in a paper by Leo Breiman. This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular:
1. Using out-of-bag error as an estimate of the generalization error.
2. Measuring variable importance through permutation.
- It is unexcelled in accuracy among current algorithms.
- It runs efficiently on large data bases.
- It can handle thousands of input variables without variable deletion.
- It gives estimates of what variables are important in the classification.
- It generates an internal unbiased estimate of the generalization error as the forest building progresses.
- It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
- It has methods for balancing error in class population unbalanced data sets.
- Generated forests can be saved for future use on other data.
- Prototypes are computed that give information about the relation between the variables and the classification.
- It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
- The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
- It offers an experimental method for detecting variable interactions.
§ Random forests differ in only one way from this general scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.
§ Typically, for a classification problem with p features, √p (rounded down) features are used in each split. For regression problems the inventors recommend p/3 (rounded down) with a minimum node size of 5 as the default.
For more elaborated detail refer :
http://www.statsoft.com/Textbook/Random-Forest