Editing Random forest (section)

==Algorithm==

===Preliminaries: decision tree learning===
{{main|Decision tree learning}}
Decision trees are a popular method for various machine learning tasks. Tree learning is almost "an off-the-shelf procedure for data mining", say [[Trevor Hastie|Hastie]] ''et al.'', "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate".<ref name="elemstatlearn">{{ElemStatLearn}}</ref>{{rp|352}}

In particular, trees that are grown very deep tend to learn highly irregular patterns: they [[overfitting|overfit]] their training sets, i.e. have [[Bias–variance tradeoff|low bias, but very high variance]]. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.<ref name="elemstatlearn"/>{{rp|587–588}} This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

===Bagging===
{{main|Bootstrap aggregating}}
[[File:Random Forest Bagging Illustration.png|thumb|Illustration of training a Random Forest model. The training dataset (in this case, of 250 rows and 100 columns) is randomly sampled with replacement ''n'' times. Then, a decision tree is trained on each sample. Finally, for prediction, the results of all ''n'' trees are aggregated to produce a final decision.]]
The training algorithm for random forests applies the general technique of [[bootstrap aggregating]], or bagging, to tree learners. Given a training set {{mvar|X}} = {{mvar|x<sub>1</sub>}}, ..., {{mvar|x<sub>n</sub>}} with responses {{mvar|Y}} = {{mvar|y<sub>1</sub>}}, ..., {{mvar|y<sub>n</sub>}}, bagging repeatedly (''B'' times) selects a [[Sampling (statistics)#Replacement of selected units|random sample with replacement]] of the training set and fits trees to these samples:
{{block indent | em = 1.5 | text =
 For {{mvar|b}} = 1, ..., {{mvar|B}}:
# Sample, with replacement, {{mvar|n}} training examples from {{mvar|X}}, {{mvar|Y}}; call these {{mvar|X<sub>b</sub>}}, {{mvar|Y<sub>b</sub>}}.
# Train a classification or regression tree {{mvar|f<sub>b</sub>}} on {{mvar|X<sub>b</sub>}}, {{mvar|Y<sub>b</sub>}}.
}}

After training, predictions for unseen samples {{mvar|x'}} can be made by averaging the predictions from all the individual regression trees on {{mvar|x'}}:

<math display="block">\hat{f} = \frac{1}{B} \sum_{b=1}^Bf_b (x')</math>

or by taking the plurality vote in the case of classification trees.

This bootstrapping procedure leads to better model performance because it decreases the [[Bias–variance dilemma|variance]] of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on {{mvar|x&prime;}}:
<math display="block">\sigma = \sqrt{\frac{\sum_{b=1}^B (f_b(x') - \hat{f})^2}{B-1} }.</math>

The number {{mvar|B}} of samples (equivalently, of trees) is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. {{mvar|B}} can be optimized using [[Cross-validation (statistics)|cross-validation]], or by observing the ''[[out-of-bag error]]'': the mean prediction error on each training sample {{mvar|x<sub>i</sub>}}, using only the trees that did not have {{mvar|x<sub>i</sub>}} in their bootstrap sample.<ref name="islr">{{cite book |author1=Gareth James |author2=Daniela Witten |author3=Trevor Hastie |author4=Robert Tibshirani |title=An Introduction to Statistical Learning |publisher=Springer |year=2013 |url=http://www-bcf.usc.edu/~gareth/ISL/ |pages=316–321}}</ref>

The training and test error tend to level off after some number of trees have been fit.

===From bagging to random forests===
{{main|Random subspace method}}
The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a [[Random subspace method|random subset of the features]]. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few [[Feature (machine learning)|features]] are very strong predictors for the response variable (target output), these features will be selected in many of the {{mvar|B}} trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.<ref name="ho2002">{{cite journal | first = Tin Kam | last = Ho | title = A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors | journal = Pattern Analysis and Applications | volume = 5 | issue = 2 | year = 2002 | pages = 102–112 | url = http://ect.bell-labs.com/who/tkh/publications/papers/compare.pdf | doi = 10.1007/s100440200009 | s2cid = 7415435 | access-date = 2015-11-13 | archive-date = 2016-04-17 | archive-url = https://web.archive.org/web/20160417091232/http://ect.bell-labs.com/who/tkh/publications/papers/compare.pdf | url-status = dead }}</ref>

Typically, for a classification problem with {{mvar|p}} features, {{sqrt|{{mvar|p}}}} (rounded down) features are used in each split.<ref name="elemstatlearn"/>{{rp|p=592}}  For regression problems the inventors recommend {{math|''p''/3}} (rounded down) with a minimum node size of 5 as the default.<ref name="elemstatlearn"/>{{rp|p=592}} In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem.<ref name="elemstatlearn"/>{{rp|592}}

===ExtraTrees===
Adding one further step of randomization yields ''extremely randomized trees'', or ExtraTrees. As with ordinary random forests, they are an ensemble of individual trees, but there are two main differences: (1) each tree is trained using the whole learning sample (rather than a bootstrap sample), and (2) the top-down splitting is randomized: for each feature under consideration, a number of ''random'' cut-points are selected, instead of computing the locally ''optimal'' cut-point (based on, e.g., [[information gain]] or the [[Gini impurity]]). The values are chosen from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly chosen splits, the split that yields the highest score is chosen to split the node.

Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are <math>\sqrt{p}</math> for classification and <math>p</math> for regression, where <math>p</math> is the number of features in the model.<ref>{{Cite journal | doi = 10.1007/s10994-006-6226-1| title = Extremely randomized trees| journal = Machine Learning| volume = 63| pages = 3–42| year = 2006| vauthors = Geurts P, Ernst D, Wehenkel L | url = http://orbi.ulg.ac.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf| doi-access = free}}</ref>

===Random forests for high-dimensional data===
The basic random forest procedure may not work well in situations where there are a large number of features but only a small proportion of these features are informative with respect to sample classification. This can be addressed by encouraging the procedure to focus mainly on features and trees that are informative. Some methods for accomplishing this are:

* Prefiltering: Eliminate features that are mostly just noise.<ref>Dessi, N. & Milia, G. & Pes, B. (2013). Enhancing random forests performance in microarray data classification. Conference paper, 99-103. 10.1007/978-3-642-38326-7_15.</ref><ref>Ye, Y., Li, H., Deng, X., and Huang, J. (2008) Feature weighting random forest for detection of hidden web search interfaces. Journal of Computational Linguistics and Chinese Language Processing, 13, 387–404.</ref>
* Enriched random forest (ERF): Use weighted random sampling instead of simple random sampling at each node of each tree, giving greater weight to features that appear to be more informative.<ref>Amaratunga, D., Cabrera, J., Lee, Y.S. (2008) Enriched Random Forest. Bioinformatics, 24, 2010-2014.</ref><ref>Ghosh D, Cabrera J. (2022) Enriched random forest for high dimensional genomic data. IEEE/ACM Trans Comput Biol Bioinform. 19(5):2817-2828. doi:10.1109/TCBB.2021.3089417. </ref>
* Tree-weighted random forest (TWRF): Give more weight to more accurate trees.<ref>Winham, Stacey & Freimuth, Robert & Biernacka, Joanna. (2013). A weighted random forests approach to improve predictive performance. Statistical Analysis and Data Mining. 6. 10.1002/sam.11196. </ref><ref> Li, H. B., Wang, W., Ding, H. W., & Dong, J. (2010, 10-12 Nov. 2010). Trees weighting random forest method for classifying high-dimensional noisy data. Paper presented at the 2010 IEEE 7th International Conference on E-Business Engineering. </ref>