Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Random forest
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Notations and definitions === ==== Preliminaries: Centered forests ==== Centered forest<ref name="breiman2004consistency"/> is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute. The algorithm stops when a fully binary tree of level <math>k</math> is built, where <math>k \in\mathbb{N} </math> is a parameter of the algorithm. ==== Uniform forest ==== Uniform forest<ref name="arlot2014analysis"/> is another simplified model for Breiman's original random forest, which uniformly selects a feature among all features and performs splits at a point uniformly drawn on the side of the cell, along the preselected feature. ==== From random forest to KeRF ==== Given a training sample <math>\mathcal{D}_n =\{(\mathbf{X}_i, Y_i)\}_{i=1}^n</math> of <math>[0,1]^p\times\mathbb{R}</math>-valued independent random variables distributed as the independent prototype pair <math>(\mathbf{X}, Y)</math>, where <math>\operatorname{E}[Y^2]<\infty</math>. We aim at predicting the response <math>Y</math>, associated with the random variable <math>\mathbf{X}</math>, by estimating the regression function <math>m(\mathbf{x})=\operatorname{E}[Y \mid \mathbf{X} = \mathbf{x}]</math>. A random regression forest is an ensemble of <math>M</math> randomized regression trees. Denote <math>m_n(\mathbf{x},\mathbf{\Theta}_j)</math> the predicted value at point <math>\mathbf{x}</math> by the <math>j</math>-th tree, where <math>\mathbf{\Theta}_1,\ldots,\mathbf{\Theta}_M </math> are independent random variables, distributed as a generic random variable <math>\mathbf{\Theta}</math>, independent of the sample <math>\mathcal{D}_n</math>. This random variable can be used to describe the randomness induced by node splitting and the sampling procedure for tree construction. The trees are combined to form the finite forest estimate <math>m_{M, n}(\mathbf{x},\Theta_1,\ldots,\Theta_M) = \frac{1}{M}\sum_{j=1}^M m_n(\mathbf{x},\Theta_j)</math>. For regression trees, we have <math>m_n = \sum_{i=1}^n\frac{Y_i\mathbf{1}_{\mathbf{X}_i\in A_n(\mathbf{x},\Theta_j)}}{N_n(\mathbf{x}, \Theta_j)}</math>, where <math>A_n(\mathbf{x},\Theta_j)</math> is the cell containing <math>\mathbf{x}</math>, designed with randomness <math>\Theta_j</math> and dataset <math>\mathcal{D}_n</math>, and <math> N_n(\mathbf{x}, \Theta_j) = \sum_{i=1}^n \mathbf{1}_{\mathbf{X}_i\in A_n(\mathbf{x}, \Theta_j)}</math>. Thus random forest estimates satisfy, for all <math>\mathbf{x}\in[0,1]^d</math>, <math> m_{M,n}(\mathbf{x}, \Theta_1,\ldots,\Theta_M) =\frac{1}{M}\sum_{j=1}^M \left(\sum_{i=1}^n\frac{Y_i\mathbf{1}_{\mathbf{X}_i\in A_n(\mathbf{x},\Theta_j)}}{N_n(\mathbf{x}, \Theta_j)}\right)</math>. Random regression forest has two levels of averaging, first over the samples in the target cell of a tree, then over all trees. Thus the contributions of observations that are in cells with a high density of data points are smaller than that of observations which belong to less populated cells. In order to improve the random forest methods and compensate the misestimation, Scornet<ref name="scornet2015random"/> defined KeRF by <math display="block"> \tilde{m}_{M,n}(\mathbf{x}, \Theta_1,\ldots,\Theta_M) = \frac{1}{\sum_{j=1}^M N_n(\mathbf{x}, \Theta_j)}\sum_{j=1}^M\sum_{i=1}^n Y_i\mathbf{1}_{\mathbf{X}_i\in A_n(\mathbf{x}, \Theta_j)},</math> which is equal to the mean of the <math>Y_i</math>'s falling in the cells containing <math>\mathbf{x}</math> in the forest. If we define the connection function of the <math>M</math> finite forest as <math>K_{M,n}(\mathbf{x}, \mathbf{z}) = \frac{1}{M} \sum_{j=1}^M \mathbf{1}_{\mathbf{z} \in A_n (\mathbf{x}, \Theta_j)}</math>, i.e. the proportion of cells shared between <math>\mathbf{x}</math> and <math>\mathbf{z}</math>, then almost surely we have <math>\tilde{m}_{M,n}(\mathbf{x}, \Theta_1,\ldots,\Theta_M) = \frac{\sum_{i=1}^n Y_i K_{M,n}(\mathbf{x}, \mathbf{x}_i)}{\sum_{\ell=1}^n K_{M,n}(\mathbf{x}, \mathbf{x}_{\ell})}</math>, which defines the KeRF. ==== Centered KeRF ==== The construction of Centered KeRF of level <math>k</math> is the same as for centered forest, except that predictions are made by <math>\tilde{m}_{M,n}(\mathbf{x}, \Theta_1,\ldots,\Theta_M) </math>, the corresponding kernel function, or connection function is <math display="block"> K_k^{cc}(\mathbf{x},\mathbf{z}) = \sum_{k_1,\ldots,k_d, \sum_{j=1}^d k_j=k} \frac{k!}{k_1!\cdots k_d!} \left(\frac 1 d \right)^k \prod_{j=1}^d\mathbf{1}_{\lceil2^{k_j}x_j\rceil=\lceil2^{k_j}z_j\rceil}, \qquad \text{ for all } \mathbf{x},\mathbf{z}\in[0,1]^d. </math> ==== Uniform KeRF ==== Uniform KeRF is built in the same way as uniform forest, except that predictions are made by <math>\tilde{m}_{M,n}(\mathbf{x}, \Theta_1,\ldots,\Theta_M) </math>, the corresponding kernel function, or connection function is <math display="block">K_k^{uf}(\mathbf{0},\mathbf{x}) = \sum_{k_1,\ldots,k_d, \sum_{j=1}^d k_j=k} \frac{k!}{k_1!\ldots k_d!}\left(\frac{1}{d}\right)^k \prod_{m=1}^d\left(1-|x_m|\sum_{j=0}^{k_m-1}\frac{\left(-\ln|x_m|\right)^j}{j!}\right) \text{ for all } \mathbf{x}\in[0,1]^d.</math>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)