Editing Decision tree learning (section)

===Limitations===
* Trees can be very non-robust. A small change in the [[Training, test, and validation sets|training data]] can result in a large change in the tree and consequently the final predictions.<ref name=":0" />
* The problem of learning an optimal decision tree is known to be [[NP-complete]] under several aspects of optimality and even for simple concepts.<ref>{{Cite journal | doi = 10.1016/0020-0190(76)90095-8 | last1 = Hyafil | first1 = Laurent | last2 = Rivest | first2 = RL | year = 1976 | title = Constructing Optimal Binary Decision Trees is NP-complete | journal = Information Processing Letters | volume = 5 | issue = 1| pages = 15–17 }}</ref><ref>Murthy S. (1998). [https://cs.nyu.edu/~roweis/csc2515-2006/readings/murthy_dt.pdf "Automatic construction of decision trees from data: A multidisciplinary survey"]. ''Data Mining and Knowledge Discovery''</ref> Consequently, practical decision-tree learning algorithms are based on heuristics such as the [[greedy algorithm]] where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. To reduce the greedy effect of local optimality, some methods such as the dual information distance (DID) tree were proposed.<ref>{{cite journal|url=http://www.eng.tau.ac.il/~bengal/DID.pdf|title=Efficient Construction of Decision Trees by the Dual Information Distance Method|author=Ben-Gal I. Dana A., Shkolnik N. and Singer|journal=Quality Technology & Quantitative Management|volume=11|issue=1|pages=133–147|year=2014|doi=10.1080/16843703.2014.11673330|s2cid=7025979|access-date=2014-02-13|archive-date=2016-06-04|archive-url=https://web.archive.org/web/20160604183738/http://www.eng.tau.ac.il/~bengal/DID.pdf|url-status=dead}}</ref>
* Decision-tree learners can create over-complex trees that do not generalize well from the training data.  (This is known as [[overfitting]].<ref>{{Cite book | title = Principles of Data Mining | doi = 10.1007/978-1-84628-766-4 | year = 2007 | isbn = 978-1-84628-765-7 | s2cid = 45746 }}</ref>)  Mechanisms such as [[Pruning (decision trees)|pruning]] are necessary to avoid this problem (with the exception of some algorithms such as the Conditional Inference approach, that does not require pruning).<ref name="Hothorn2006" /><ref name="Strobl2009" />
* The average depth of the tree that is defined by the number of nodes or tests till classification is not guaranteed to be minimal or small under various splitting criteria.<ref name="Tris">{{cite web|author = Ben-Gal I. and Trister C. (2015)|title = Parallel Construction of Decision Trees with Consistently Non Increasing Expected Number of Tests|url = http://www.eng.tau.ac.il/~bengal/Trist.pdf|publisher = Applied Stochastic Models in Business and Industry, Vol. 31(1) 64-78|access-date = 2021-01-30|archive-date = 2021-02-05|archive-url = https://web.archive.org/web/20210205043215/http://www.eng.tau.ac.il/~bengal/Trist.pdf|url-status = dead}}</ref> 
* For data including categorical variables with different numbers of levels, [[information gain in decision trees]] is biased in favor of attributes with more levels.<ref>{{cite conference|author=Deng, H.|author2=Runger, G. |author3=Tuv, E. |title=Bias of importance measures for multi-valued attributes and solutions|conference=Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN)|year=2011|pages= 293–300|url=https://www.researchgate.net/publication/221079908}}</ref> To counter this problem,  instead of choosing the attribute with highest [[information gain]], one can choose the attribute with the highest [[information gain ratio]] among the attributes whose information gain is greater than the mean information gain.<ref>{{cite journal |doi=10.1007/BF00116251 |last=Quinlan |first=J. Ross |title=Induction of Decision Trees |journal=[[Machine Learning (journal)|Machine Learning]] |volume=1 |issue=1 |year=1986 |pages=81–106 |doi-access=free }}</ref> This biases the decision tree against considering attributes with a large number of distinct values, while not giving an unfair advantage to attributes with very low information gain. Alternatively, the issue of biased predictor selection can be avoided by the Conditional Inference approach,<ref name="Hothorn2006" /> a two-stage approach,<ref>{{Cite journal|last1=Brandmaier|first1=Andreas M.|last2=Oertzen|first2=Timo von|last3=McArdle|first3=John J.|last4=Lindenberger|first4=Ulman|title=Structural equation model trees.|journal=Psychological Methods|language=en|volume=18|issue=1|pages=71–86|doi=10.1037/a0030001|pmid=22984789|pmc=4386908|year=2012|hdl=11858/00-001M-0000-0024-EA33-9}}</ref> or adaptive leave-one-out feature selection.<ref>{{cite journal|last1=Painsky|first1=Amichai|last2=Rosset|first2=Saharon|title=Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|date=2017|volume=39|issue=11|pages=2142–2153|pmid=28114007|doi=10.1109/TPAMI.2016.2636831|arxiv=1512.03444|s2cid=5381516}}</ref>