Boosting

By contrast, in boosting methods, a base estimator is created, and evaluated. Based on the error left over from the base estimate, a second estimator is created specifically to reduce the error from the first estimator. Then the base estimator and the second estimator are run together. This resulting error is less than when the base model is run alone. Then a third estimator is designed to reduce the error that results from the combination of the first and second estimator, and so forth. The motivation is to combine several weak models to produce a powerful ensemble. This refinement process can be extended for possibly many boosting iterations. Hence, whereas the individual models of random forests are independent using a vote of each to generate a single prediction, boosted models are additive - each subsequent model refining the prediction of the previous.

Gradient boosting combines weak learners into a single strong learner, in an iterative fashion. It is easiest to explain in the least-squares regression setting, where the goal is to train a model F that predicts values ŷ of the outcome variabe,

f = F(x)

The goal is to minimize the mean squared of the error (yi - ŷi)2 to the true values y (averaged over some training set). At each stage 1≤ m ≤ M of gradient boosting, it may be assumed that there is some imperfect model

Fm

At the outset, a very weak model that just predicts the mean y in the training set could be used. The gradient boosting algorithm does not change the initial function, , in any way; instead, it improves on it by constructing a new model that adds an estimator h to provide a better model

Fm+1 = Fm(x) + h(x) = y

The question becomes how to find h? The gradient boosting solution starts with the observation that a perfect h would imply

Fm+1 = Fm(x) + h(x) = y

or, equivalently,

h(x) = y - Fm(x).

Therefore, gradient boosting will fit h to the residual y - Fm(x). Each Fm+1 learns to correct its predecessor Fm. The term "gradient" has several meanings in mathematics. The simplest is as a synonym for slope. So looking at h(x) = y - Fm(x), the h in h(x) can be considered like a slope. Simply put, what does x need to be multiplied by to make the residual as small as possible? An h is found that accomplishes that objective. Then h(x) is added as a term to the larger model. Then the residual is calculated again. A new h is then found that minimizes that residual, and so on until the residuals are very small or non-existent. The predicted value of the boosted model as a whole is the final predicted (ŷ) value after all M models have been applied.

Freund and Schapire (1997) developed a statistical framework to validate the boosting methodology based on a gradient-descent approach. This was later refined by Friedman et al. (2000 and 2001) that added improvements to the algorithm to reduce overfitting. In the proposed approach, the residuals are first transformed according to a user selected loss function before submission to the subsequent modeler, then a gradient-descent search is employed to find the optimal weighting of the modeler's contribution to the overall prediction. In regression a commonly used loss function is squared error, while in classification a binomial loss function is used.

Through their conceptual work and empirical data collection, it was demonstrated that the individual modelers need not be strong predictors. A sequence of weak predictors, when combined, performed just as well as the strong predictors while reducing the tendency to overfit. For this reason, the modeler of choice in gradient boosting is the tree which is also restricted in height to weaken even more its individual predictive capability.