15.3 Benefits and Issues of Random Forest

Benefits

The primary benefit of random forests is their prediction accuracy. On many datasets they perform at least as well, if not better, than other classification and regression algorithms.

A second benefit is that feature selection (identification of important predictor variables for use) is not required. As with any tree construction algorithm, candidate predictor variables that do not differentiate well between response variable values are never chosen to split a node. The only cost during model construction of including predictors that obviously don't contribute is the processing time required to determine that they cannot be employed to effectively split a node.

Issues

To achieve prediction performance random forests sacrifice transparency and simplicity. Individual trees that could easily be visualized and interpreted standing alone become part of an ensemble or collection of trees. The ensemble produces good predictions, but no simple, transparent explanation is produced to show how that performance is achieved. Why? There is not one but many trees, so choosing one tree would not represent the entire ensemble of trees.

A second issue with random forests is construction time, which increases linearly with respect to the number of trees in the forest. Prediction time also increases in the same way, as the prediction of the forest is based on a vote of each tree's prediction. Prediction times are usually not an issue except when large datasets (10,000+ observations) are fed to the model for processing.

The processing cost is somewhat mitigated by parallel processing. Because each tree is independent of all other trees, individual tree can be constructed in parallel on different processors.

Previous Next