Editing Random forest (section)

==== Mean decrease in impurity feature importance ====
This approach to feature importance for random forests considers as important the variables which decrease a lot the impurity during splitting.<ref>{{Cite book |last=Ortiz-Posadas |first=Martha Refugio |url=https://books.google.com/books?id=d6LTDwAAQBAJ&dq=Mean+Decrease+in+Impurity+Feature+Importance&pg=PA116 |title=Pattern Recognition Techniques Applied to Biomedical Problems |date=2020-02-29 |publisher=Springer Nature |isbn=978-3-030-38021-2 |language=en}}</ref> It is described in the book ''Classification and Regression Trees'' by Leo Breiman<ref>{{Cite book |last=Breiman |first=Leo |url=https://www.taylorfrancis.com/books/mono/10.1201/9781315139470/classification-regression-trees-leo-breiman |title=Classification and Regression Trees |date=2017-10-25 |publisher=Routledge |isbn=978-1-315-13947-0 |location=New York |doi=10.1201/9781315139470}}</ref> and is the default implementation in [[Scikit-learn|<code>sci-kit learn</code>]] and [[R (programming language)|R]]. The definition is:<math display="block">\text{unormalized average importance}(x)=\frac{1}{n_T} \sum_{i=1}^{n_T} \sum_{\text{node }j \in T_i | \text{split variable}(j) = x} p_{T_i}(j)\Delta i_{T_i}(j),</math>where 

* <math>x</math> is a feature
* <math>n_T</math> is the number of trees in the forest
* <math>T_i</math> is tree <math>i</math>
* <math>p_{T_i}(j)=\frac{n_j}{n}</math> is the fraction of samples reaching node <math>j</math>
* <math>\Delta i_{T_i}(j)</math> is the change in impurity in tree <math>t</math> at node <math>j</math>.

As impurity measure for samples falling in a node e.g. the following statistics can be used:
*[[Entropy (information theory)|Entropy]]
*[[Gini coefficient]]
*[[Mean squared error]]

The normalized importance is then obtained by normalizing over all features, so that the sum of normalized feature importances is 1.

The <code>sci-kit learn</code> default implementation can report misleading feature importance:<ref name=":2" />
* it favors high cardinality features
* it uses training statistics and so does not reflect a feature's usefulness for predictions on a test set<ref>https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html 31. Aug. 2023</ref>