Random forest

Template:About Template:Short description Template:Machine learning

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees.<ref name="ho1995"/><ref name="ho1998"/> Random forests correct for decision trees' habit of overfitting to their training set.Template:R Template:Rp

The first algorithm for random decision forests was created in 1995 by Tin Kam Ho<ref name="ho1995">Template:Cite conference</ref> using the random subspace method,<ref name="ho1998">Template:Cite journal</ref> which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg.<ref name="kleinberg1990">Template:Cite journal</ref><ref name="kleinberg1996">Template:Cite journal</ref><ref name="kleinberg2000">Template:Cite journal</ref>

An extension of the algorithm was developed by Leo Breiman<ref name="breiman2001">Template:Cite journal</ref> and Adele Cutler,<ref name="rpackage"/> who registered<ref>U.S. trademark registration number 3185828, registered 2006/12/19.</ref> "Random Forests" as a trademark in 2006 (Template:As of, owned by Minitab, Inc.).<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref> The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho<ref name="ho1995"/> and later independently by Amit and Geman<ref name="amitgeman1997">Template:Cite journal</ref> in order to construct a collection of decision trees with controlled variance.

HistoryEdit

The general method of random decision forests was first proposed by Salzberg and Heath in 1993,<ref>Heath, D., Kasif, S. and Salzberg, S. (1993). k-DT: A multi-tree learning method. In Proceedings of the Second Intl. Workshop on Multistrategy Learning, pp. 138-149.</ref> with a method that used a randomized decision tree algorithm to create multiple trees and then combine them using majority voting. This idea was developed further by Ho in 1995.<ref name="ho1995"/> Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines<ref name="ho1998"/> concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. This observation that a more complex classifier (a larger forest) gets more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination.<ref name="kleinberg1990"/><ref name="kleinberg1996"/><ref name="kleinberg2000"/>

The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman<ref name="amitgeman1997"/> who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho<ref name="ho1998"/> was also influential in the design of random forests. This method grows a forest of trees, and introduces variation among the trees by projecting the training data into a randomly chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Thomas G. Dietterich.<ref>Template:Cite journal</ref>

The proper introduction of random forests was made in a paper by Leo Breiman.<ref name="breiman2001"/> This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular:

Using out-of-bag error as an estimate of the generalization error.
Measuring variable importance through permutation.

The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation.

AlgorithmEdit

Preliminaries: decision tree learningEdit

{{#invoke:Labelled list hatnote|labelledList|Main article|Main articles|Main page|Main pages}} Decision trees are a popular method for various machine learning tasks. Tree learning is almost "an off-the-shelf procedure for data mining", say Hastie et al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate".<ref name="elemstatlearn">Template:ElemStatLearn</ref>Template:Rp

In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.<ref name="elemstatlearn"/>Template:Rp This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

BaggingEdit

{{#invoke:Labelled list hatnote|labelledList|Main article|Main articles|Main page|Main pages}}

File:Random Forest Bagging Illustration.png

Illustration of training a Random Forest model. The training dataset (in this case, of 250 rows and 100 columns) is randomly sampled with replacement n times. Then, a decision tree is trained on each sample. Finally, for prediction, the results of all n trees are aggregated to produce a final decision.

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set Template:Mvar = Template:Mvar, ..., Template:Mvar with responses Template:Mvar = Template:Mvar, ..., Template:Mvar, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: Template:Block indent

After training, predictions for unseen samples Template:Mvar can be made by averaging the predictions from all the individual regression trees on Template:Mvar:

or by taking the plurality vote in the case of classification trees.

This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on Template:Mvar: <math display="block">\sigma = \sqrt{\frac{\sum_{b=1}^B (f_b(x') - \hat{f})^2}{B-1} }.</math>

The number Template:Mvar of samples (equivalently, of trees) is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. Template:Mvar can be optimized using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample Template:Mvar, using only the trees that did not have Template:Mvar in their bootstrap sample.<ref name="islr">Template:Cite book</ref>

The training and test error tend to level off after some number of trees have been fit.

From bagging to random forestsEdit

{{#invoke:Labelled list hatnote|labelledList|Main article|Main articles|Main page|Main pages}} The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the Template:Mvar trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.<ref name="ho2002">Template:Cite journal</ref>

Typically, for a classification problem with Template:Mvar features, Template:Sqrt (rounded down) features are used in each split.<ref name="elemstatlearn"/>Template:Rp For regression problems the inventors recommend Template:Math (rounded down) with a minimum node size of 5 as the default.<ref name="elemstatlearn"/>Template:Rp In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem.<ref name="elemstatlearn"/>Template:Rp

ExtraTreesEdit

Adding one further step of randomization yields extremely randomized trees, or ExtraTrees. As with ordinary random forests, they are an ensemble of individual trees, but there are two main differences: (1) each tree is trained using the whole learning sample (rather than a bootstrap sample), and (2) the top-down splitting is randomized: for each feature under consideration, a number of random cut-points are selected, instead of computing the locally optimal cut-point (based on, e.g., information gain or the Gini impurity). The values are chosen from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly chosen splits, the split that yields the highest score is chosen to split the node.

Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are <math>\sqrt{p}</math> for classification and <math>p</math> for regression, where <math>p</math> is the number of features in the model.<ref>Template:Cite journal</ref>

Random forests for high-dimensional dataEdit

The basic random forest procedure may not work well in situations where there are a large number of features but only a small proportion of these features are informative with respect to sample classification. This can be addressed by encouraging the procedure to focus mainly on features and trees that are informative. Some methods for accomplishing this are:

Prefiltering: Eliminate features that are mostly just noise.<ref>Dessi, N. & Milia, G. & Pes, B. (2013). Enhancing random forests performance in microarray data classification. Conference paper, 99-103. 10.1007/978-3-642-38326-7_15.</ref><ref>Ye, Y., Li, H., Deng, X., and Huang, J. (2008) Feature weighting random forest for detection of hidden web search interfaces. Journal of Computational Linguistics and Chinese Language Processing, 13, 387–404.</ref>
Enriched random forest (ERF): Use weighted random sampling instead of simple random sampling at each node of each tree, giving greater weight to features that appear to be more informative.<ref>Amaratunga, D., Cabrera, J., Lee, Y.S. (2008) Enriched Random Forest. Bioinformatics, 24, 2010-2014.</ref><ref>Ghosh D, Cabrera J. (2022) Enriched random forest for high dimensional genomic data. IEEE/ACM Trans Comput Biol Bioinform. 19(5):2817-2828. doi:10.1109/TCBB.2021.3089417. </ref>
Tree-weighted random forest (TWRF): Give more weight to more accurate trees.<ref>Winham, Stacey & Freimuth, Robert & Biernacka, Joanna. (2013). A weighted random forests approach to improve predictive performance. Statistical Analysis and Data Mining. 6. 10.1002/sam.11196. </ref><ref> Li, H. B., Wang, W., Ding, H. W., & Dong, J. (2010, 10-12 Nov. 2010). Trees weighting random forest method for classifying high-dimensional noisy data. Paper presented at the 2010 IEEE 7th International Conference on E-Business Engineering. </ref>

PropertiesEdit

Variable importanceEdit

Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper<ref name=breiman2001/> and is implemented in the R package randomForest.<ref name="rpackage">{{#invoke:citation/CS1|citation |CitationClass=web }} </ref>

Permutation importanceEdit

To measure a feature's importance in a data set <math>\mathcal{D}_n =\{(X_i, Y_i)\}_{i=1}^n</math>, first a random forest is trained on the data. During training, the out-of-bag error for each data point is recorded and averaged over the forest. (If bagging is not used during training, we can instead compute errors on an independent test set.)

After training, the values of the feature are permuted in the out-of-bag samples and the out-of-bag error is again computed on this perturbed data set. The importance for the feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.

Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhu et al.<ref>Template:Cite journal</ref>

This method of determining variable importance has some drawbacks:

When features have different numbers of values, random forests favor features with more values. Solutions to this problem include partial permutations<ref>Template:Cite conference</ref><ref>Template:Cite journal</ref><ref name=":02">Template:Cite journal</ref> and growing unbiased trees.<ref>Template:Cite journal</ref><ref>Template:Cite journal</ref>
If the data contain groups of correlated features of similar relevance, then smaller groups are favored over large groups.<ref>Template:Cite journal</ref>
If there are collinear features, the procedure may fail to identify important features. A solution is to permute groups of correlated features together.<ref name=":2">{{#invoke:citation/CS1|citation

|CitationClass=web }}</ref>

Mean decrease in impurity feature importanceEdit

This approach to feature importance for random forests considers as important the variables which decrease a lot the impurity during splitting.<ref>Template:Cite book</ref> It is described in the book Classification and Regression Trees by Leo Breiman<ref>Template:Cite book</ref> and is the default implementation in sci-kit learn and R. The definition is:<math display="block">\text{unormalized average importance}(x)=\frac{1}{n_T} \sum_{i=1}^{n_T} \sum_{\text{node }j \in T_i | \text{split variable}(j) = x} p_{T_i}(j)\Delta i_{T_i}(j),</math>where

<math>x</math> is a feature
<math>n_T</math> is the number of trees in the forest
<math>T_i</math> is tree <math>i</math>
<math>p_{T_i}(j)=\frac{n_j}{n}</math> is the fraction of samples reaching node <math>j</math>
<math>\Delta i_{T_i}(j)</math> is the change in impurity in tree <math>t</math> at node <math>j</math>.

As impurity measure for samples falling in a node e.g. the following statistics can be used:

The normalized importance is then obtained by normalizing over all features, so that the sum of normalized feature importances is 1.

The sci-kit learn default implementation can report misleading feature importance:<ref name=":2" />

it favors high cardinality features
it uses training statistics and so does not reflect a feature's usefulness for predictions on a test set<ref>https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html 31. Aug. 2023</ref>

Relationship to nearest neighborsEdit

A relationship between random forests and the [[K-nearest neighbor algorithm|Template:Mvar-nearest neighbor algorithm]] (Template:Mvar-NN) was pointed out by Lin and Jeon in 2002.<ref name="linjeon02">Template:Cite tech report</ref> Both can be viewed as so-called weighted neighborhoods schemes. These are models built from a training set <math>\{(x_i, y_i)\}_{i=1}^n</math> that make predictions <math>\hat{y}</math> for new points Template:Mvar by looking at the "neighborhood" of the point, formalized by a weight function Template:Mvar:<math display="block">\hat{y} = \sum_{i=1}^n W(x_i, x') \, y_i.</math>Here, <math>W(x_i, x')</math> is the non-negative weight of the Template:Mvar'th training point relative to the new point Template:Mvar in the same tree. For any Template:Mvar, the weights for points <math>x_i</math> must sum to 1. Weight functions are as follows:

In Template:Mvar-NN, <math>W(x_i, x') = \frac{1}{k}</math> if Template:Mvar is one of the Template:Mvar points closest to Template:Mvar, and zero otherwise.
In a tree, <math>W(x_i, x') = \frac{1}{k'}</math> if Template:Mvar is one of the Template:Mvar points in the same leaf as Template:Mvar, and zero otherwise.

Since a forest averages the predictions of a set of Template:Mvar trees with individual weight functions <math>W_j</math>, its predictions are<math display="block">\hat{y} = \frac{1}{m}\sum_{j=1}^m\sum_{i=1}^n W_{j}(x_i, x') \, y_i = \sum_{i=1}^n\left(\frac{1}{m}\sum_{j=1}^m W_{j}(x_i, x')\right) \, y_i.</math>

This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of Template:Mvar in this interpretation are the points <math>x_i</math> sharing the same leaf in any tree <math>j</math>. In this way, the neighborhood of Template:Mvar depends in a complex way on the structure of the trees, and thus on the structure of the training set. Lin and Jeon show that the shape of the neighborhood used by a random forest adapts to the local importance of each feature.<ref name="linjeon02"/>

Unsupervised learningEdit

As part of their construction, random forest predictors naturally lead to a dissimilarity measure among observations. One can analogously define dissimilarity between unlabeled data, by training a forest to distinguish original "observed" data from suitably generated synthetic data drawn from a reference distribution.<ref name=breiman2001/><ref>Template:Cite journal</ref> A random forest dissimilarity is attractive because it handles mixed variable types very well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. Random forest dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection; for example, the "Addcl 1" random forest dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. Random forest dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data.<ref>Template:Cite journal</ref>

VariantsEdit

Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particular multinomial logistic regression and naive Bayes classifiers.<ref name=":0">Template:Cite journal</ref><ref>Template:Cite journal</ref><ref>Template:Cite conference</ref> In cases that the relationship between the predictors and the target variable is linear, the base learners may have an equally high accuracy as the ensemble learner.<ref name=":1">Template:Cite journal</ref><ref name=":0" />

Kernel random forestEdit

In machine learning, kernel random forests (KeRF) establish the connection between random forests and kernel methods. By slightly modifying their definition, random forests can be rewritten as kernel methods, which are more interpretable and easier to analyze.<ref name="scornet2015random">Template:Cite arXiv</ref>

HistoryEdit

Leo Breiman<ref name="breiman2000some">Template:Cite journal</ref> was the first person to notice the link between random forest and kernel methods. He pointed out that random forests trained using i.i.d. random vectors in the tree construction are equivalent to a kernel acting on the true margin. Lin and Jeon<ref name="lin2006random">Template:Cite journal</ref> established the connection between random forests and adaptive nearest neighbor, implying that random forests can be seen as adaptive kernel estimates. Davies and Ghahramani<ref name="davies2014random">Template:Cite arXiv</ref> proposed Kernel Random Forest (KeRF) and showed that it can empirically outperform state-of-art kernel methods. Scornet<ref name="scornet2015random"/> first defined KeRF estimates and gave the explicit link between KeRF estimates and random forest. He also gave explicit expressions for kernels based on centered random forest<ref name="breiman2004consistency">Template:Cite journal</ref> and uniform random forest,<ref name="arlot2014analysis">Template:Cite arXiv</ref> two simplified models of random forest. He named these two KeRFs Centered KeRF and Uniform KeRF, and proved upper bounds on their rates of consistency.

Notations and definitionsEdit

Preliminaries: Centered forestsEdit

Centered forest<ref name="breiman2004consistency"/> is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute. The algorithm stops when a fully binary tree of level <math>k</math> is built, where <math>k \in\mathbb{N} </math> is a parameter of the algorithm.

Uniform forestEdit

Uniform forest<ref name="arlot2014analysis"/> is another simplified model for Breiman's original random forest, which uniformly selects a feature among all features and performs splits at a point uniformly drawn on the side of the cell, along the preselected feature.

From random forest to KeRFEdit

Given a training sample <math>\mathcal{D}_n =\{(\mathbf{X}_i, Y_i)\}_{i=1}^n</math> of <math>[0,1]^p\times\mathbb{R}</math>-valued independent random variables distributed as the independent prototype pair <math>(\mathbf{X}, Y)</math>, where <math>\operatorname{E}[Y^2]<\infty</math>. We aim at predicting the response <math>Y</math>, associated with the random variable <math>\mathbf{X}</math>, by estimating the regression function <math>m(\mathbf{x})=\operatorname{E}[Y \mid \mathbf{X} = \mathbf{x}]</math>. A random regression forest is an ensemble of <math>M</math> randomized regression trees. Denote <math>m_n(\mathbf{x},\mathbf{\Theta}_j)</math> the predicted value at point <math>\mathbf{x}</math> by the <math>j</math>-th tree, where <math>\mathbf{\Theta}_1,\ldots,\mathbf{\Theta}_M </math> are independent random variables, distributed as a generic random variable <math>\mathbf{\Theta}</math>, independent of the sample <math>\mathcal{D}_n</math>. This random variable can be used to describe the randomness induced by node splitting and the sampling procedure for tree construction. The trees are combined to form the finite forest estimate <math>m_{M, n}(\mathbf{x},\Theta_1,\ldots,\Theta_M) = \frac{1}{M}\sum_{j=1}^M m_n(\mathbf{x},\Theta_j)</math>. For regression trees, we have <math>m_n = \sum_{i=1}^n\frac{Y_i\mathbf{1}_{\mathbf{X}_i\in A_n(\mathbf{x},\Theta_j)}}{N_n(\mathbf{x}, \Theta_j)}</math>, where <math>A_n(\mathbf{x},\Theta_j)</math> is the cell containing <math>\mathbf{x}</math>, designed with randomness <math>\Theta_j</math> and dataset <math>\mathcal{D}_n</math>, and <math> N_n(\mathbf{x}, \Theta_j) = \sum_{i=1}^n \mathbf{1}_{\mathbf{X}_i\in A_n(\mathbf{x}, \Theta_j)}</math>.

Thus random forest estimates satisfy, for all <math>\mathbf{x}\in[0,1]^d</math>, <math> m_{M,n}(\mathbf{x}, \Theta_1,\ldots,\Theta_M) =\frac{1}{M}\sum_{j=1}^M \left(\sum_{i=1}^n\frac{Y_i\mathbf{1}_{\mathbf{X}_i\in A_n(\mathbf{x},\Theta_j)}}{N_n(\mathbf{x}, \Theta_j)}\right)</math>. Random regression forest has two levels of averaging, first over the samples in the target cell of a tree, then over all trees. Thus the contributions of observations that are in cells with a high density of data points are smaller than that of observations which belong to less populated cells. In order to improve the random forest methods and compensate the misestimation, Scornet<ref name="scornet2015random"/> defined KeRF by <math display="block"> \tilde{m}_{M,n}(\mathbf{x}, \Theta_1,\ldots,\Theta_M) = \frac{1}{\sum_{j=1}^M N_n(\mathbf{x}, \Theta_j)}\sum_{j=1}^M\sum_{i=1}^n Y_i\mathbf{1}_{\mathbf{X}_i\in A_n(\mathbf{x}, \Theta_j)},</math> which is equal to the mean of the <math>Y_i</math>'s falling in the cells containing <math>\mathbf{x}</math> in the forest. If we define the connection function of the <math>M</math> finite forest as <math>K_{M,n}(\mathbf{x}, \mathbf{z}) = \frac{1}{M} \sum_{j=1}^M \mathbf{1}_{\mathbf{z} \in A_n (\mathbf{x}, \Theta_j)}</math>, i.e. the proportion of cells shared between <math>\mathbf{x}</math> and <math>\mathbf{z}</math>, then almost surely we have <math>\tilde{m}_{M,n}(\mathbf{x}, \Theta_1,\ldots,\Theta_M) = \frac{\sum_{i=1}^n Y_i K_{M,n}(\mathbf{x}, \mathbf{x}_i)}{\sum_{\ell=1}^n K_{M,n}(\mathbf{x}, \mathbf{x}_{\ell})}</math>, which defines the KeRF.

Centered KeRFEdit

The construction of Centered KeRF of level <math>k</math> is the same as for centered forest, except that predictions are made by <math>\tilde{m}_{M,n}(\mathbf{x}, \Theta_1,\ldots,\Theta_M) </math>, the corresponding kernel function, or connection function is <math display="block"> K_k^{cc}(\mathbf{x},\mathbf{z}) = \sum_{k_1,\ldots,k_d, \sum_{j=1}^d k_j=k} \frac{k!}{k_1!\cdots k_d!} \left(\frac 1 d \right)^k \prod_{j=1}^d\mathbf{1}_{\lceil2^{k_j}x_j\rceil=\lceil2^{k_j}z_j\rceil}, \qquad \text{ for all } \mathbf{x},\mathbf{z}\in[0,1]^d. </math>

Uniform KeRFEdit

Uniform KeRF is built in the same way as uniform forest, except that predictions are made by <math>\tilde{m}_{M,n}(\mathbf{x}, \Theta_1,\ldots,\Theta_M) </math>, the corresponding kernel function, or connection function is <math display="block">K_k^{uf}(\mathbf{0},\mathbf{x}) = \sum_{k_1,\ldots,k_d, \sum_{j=1}^d k_j=k} \frac{k!}{k_1!\ldots k_d!}\left(\frac{1}{d}\right)^k \prod_{m=1}^d\left(1-|x_m|\sum_{j=0}^{k_m-1}\frac{\left(-\ln|x_m|\right)^j}{j!}\right) \text{ for all } \mathbf{x}\in[0,1]^d.</math>

PropertiesEdit

Relation between KeRF and random forestEdit

Predictions given by KeRF and random forests are close if the number of points in each cell is controlled:

Assume that there exist sequences <math> (a_n),(b_n) </math> such that, almost surely, <math display="block"> a_n\leq N_n(\mathbf{x},\Theta)\leq b_n \text{ and } a_n\leq \frac 1 M \sum_{m=1}^M N_n {\mathbf{x},\Theta_m}\leq b_n. </math> Then almost surely, <math display="block">|m_{M,n}(\mathbf{x}) - \tilde{m}_{M,n}(\mathbf{x})| \le\frac{b_n-a_n}{a_n} \tilde{m}_{M,n}(\mathbf{x}). </math>

Relation between infinite KeRF and infinite random forestEdit

When the number of trees <math>M</math> goes to infinity, then we have infinite random forest and infinite KeRF. Their estimates are close if the number of observations in each cell is bounded:

Assume that there exist sequences <math>(\varepsilon_n), (a_n),(b_n)</math> such that, almost surely

<math>\operatorname{E}[N_n(\mathbf{x},\Theta)] \ge 1,</math>

<math>\operatorname{P}[a_n\le N_n(\mathbf{x},\Theta) \le b_n\mid \mathcal{D}_n] \ge 1-\varepsilon_n/2,</math>

<math>\operatorname{P}[a_n\le \operatorname{E}_\Theta [N_n(\mathbf{x},\Theta)] \le b_n\mid \mathcal{D}_n] \ge 1-\varepsilon_n/2,</math>

Then almost surely, <math display="block"> |m_{\infty,n}(\mathbf{x})-\tilde{m}_{\infty,n}(\mathbf{x})| \le \frac{b_n-a_n}{a_n}\tilde{m}_{\infty,n}(\mathbf{x}) + n \varepsilon_n \left( \max_{1\le i\le n} Y_i \right).</math>

Consistency resultsEdit

Assume that <math>Y = m(\mathbf{X}) + \varepsilon</math>, where <math>\varepsilon</math> is a centered Gaussian noise, independent of <math>\mathbf{X}</math>, with finite variance <math>\sigma^2<\infty</math>. Moreover, <math>\mathbf{X}</math> is uniformly distributed on <math>[0,1]^d</math> and <math>m</math> is Lipschitz. Scornet<ref name="scornet2015random"/> proved upper bounds on the rates of consistency for centered KeRF and uniform KeRF.

Consistency of centered KeRFEdit

Providing <math>k\rightarrow\infty</math> and <math>n/2^k\rightarrow\infty</math>, there exists a constant <math>C_1>0</math> such that, for all <math>n</math>, <math> \mathbb{E}[\tilde{m}_n^{cc}(\mathbf{X}) - m(\mathbf{X})]^2 \le C_1 n^{-1/(3+d\log 2)}(\log n)^2</math>.

Consistency of uniform KeRFEdit

Providing <math>k\rightarrow\infty</math> and <math>n/2^k\rightarrow\infty</math>, there exists a constant <math>C>0</math> such that, <math>\mathbb{E}[\tilde{m}_n^{uf}(\mathbf{X})-m(\mathbf{X})]^2\le Cn^{-2/(6+3d\log2)}(\log n)^2</math>.

DisadvantagesEdit

While random forests often achieve higher accuracy than a single decision tree, they sacrifice the intrinsic interpretability of decision trees. Decision trees are among a fairly small family of machine learning models that are easily interpretable along with linear models, rule-based models, and attention-based models. This interpretability is one of the main advantages of decision trees. It allows developers to confirm that the model has learned realistic information from the data and allows end-users to have trust and confidence in the decisions made by the model.<ref name=":0" /><ref name="elemstatlearn" /> For example, following the path that a decision tree takes to make its decision is quite trivial, but following the paths of tens or hundreds of trees is much harder. To achieve both performance and interpretability, some model compression techniques allow transforming a random forest into a minimal "born-again" decision tree that faithfully reproduces the same decision function.<ref name=":0" /><ref>Template:Cite journal</ref><ref>Template:Cite journal</ref>

Another limitation of random forests is that if features are linearly correlated with the target, random forest may not enhance the accuracy of the base learner.<ref name=":0" /><ref name=":1" /> Likewise in problems with multiple categorical variables.<ref name=":3">Template:Cite thesis</ref>

ReferencesEdit

Template:Reflist

External linksEdit

Random Forests classifier description (Leo Breiman's site)
Liaw, Andy & Wiener, Matthew "Classification and Regression by randomForest" R News (2002) Vol. 2/3 p. 18 (Discussion of the use of the random forest package for R)

Random forest

Contents

HistoryEdit

AlgorithmEdit

Preliminaries: decision tree learningEdit

BaggingEdit

From bagging to random forestsEdit

ExtraTreesEdit

Random forests for high-dimensional dataEdit

PropertiesEdit

Variable importanceEdit

Permutation importanceEdit

Mean decrease in impurity feature importanceEdit

Relationship to nearest neighborsEdit

Unsupervised learningEdit

VariantsEdit

Kernel random forestEdit

HistoryEdit

Notations and definitionsEdit

Preliminaries: Centered forestsEdit

Uniform forestEdit

From random forest to KeRFEdit

Centered KeRFEdit

Uniform KeRFEdit

PropertiesEdit

Relation between KeRF and random forestEdit

Relation between infinite KeRF and infinite random forestEdit

Consistency resultsEdit

Consistency of centered KeRFEdit

Consistency of uniform KeRFEdit

DisadvantagesEdit

See alsoEdit

ReferencesEdit

Further readingEdit

External linksEdit