Editing Decision tree learning (section)

==Uses==

===Advantages===
Amongst other data mining methods, decision trees have various advantages:
* '''Simple to understand and interpret.''' People are able to understand decision tree models after a brief explanation. Trees can also be displayed graphically in a way that is easy for non-experts to interpret.<ref name=":0">{{Cite book|title=An Introduction to Statistical Learning|url=https://archive.org/details/introductiontost00jame|url-access=limited|last1=Gareth|first1=James|last2=Witten|first2=Daniela|last3=Hastie|first3=Trevor|last4=Tibshirani|first4=Robert|publisher=Springer|year=2015|isbn=978-1-4614-7137-0|location=New York|pages=[https://archive.org/details/introductiontost00jame/page/n323 315]}}</ref>
* '''Able to handle both numerical and [[Categorical variable|categorical]] data.'''<ref name=":0" /> Other techniques are usually specialized in analyzing datasets that have only one type of variable. (For example, relation rules can be used only with nominal variables while neural networks can be used only with numerical variables or categoricals converted to 0-1 values.) Early decision trees were only capable of handling categorical variables, but more recent versions, such as C4.5, do not have this limitation.<ref name="tdidt" />
* '''Requires little data preparation.''' Other techniques often require data normalization. Since trees can handle qualitative predictors, there is no need to create [[dummy variable (statistics)|dummy variables]].<ref name=":0" />
* '''Uses a [[white box (software engineering)|white box]] or open-box<ref name="tdidt" /> model.''' If a given situation is observable in a model the explanation for the condition is easily explained by [[Boolean logic]]. By contrast, in a [[black box]] model, the explanation for the results is typically difficult to understand, for example with an [[artificial neural network]].
* '''Possible to validate a model using statistical tests.''' That makes it possible to account for the reliability of the model.
* Non-parametric approach that makes no assumptions of the training data or prediction residuals; e.g., no distributional, independence, or constant variance assumptions
* '''Performs well with large datasets.''' Large amounts of data can be analyzed using standard computing resources in reasonable time.
* '''Accuracy with flexible modeling'''. These methods may be applied to healthcare research with increased accuracy.<ref>{{Cite journal |last1=Hu |first1=Liangyuan |last2=Li |first2=Lihua |date=2022-12-01 |title=Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series |journal=International Journal of Environmental Research and Public Health |language=en |volume=19 |issue=23 |pages=16080 |doi=10.3390/ijerph192316080 |issn=1660-4601 |pmc=9736500 |pmid=36498153|doi-access=free }}</ref>
* '''Mirrors human decision making more closely than other approaches.'''<ref name=":0" /> This could be useful when modeling human decisions/behavior.
* '''Robust against co-linearity, particularly boosting.'''
* '''In built''' '''[[feature selection]]'''. Additional irrelevant feature will be less used so that they can be removed on subsequent runs. The hierarchy of attributes in a decision tree reflects the importance of attributes.<ref>{{Cite book|last=Provost, Foster, 1964-|title=Data science for business : [what you need to know about data mining and data-analytic thinking]|date=2013|publisher=O'Reilly|others=Fawcett, Tom.|isbn=978-1-4493-6132-7|edition= 1st|location=Sebastopol, Calif.|oclc=844460899}}</ref> It means that the features on top are the most informative.<ref>{{Cite journal|last1=Piryonesi S. Madeh|last2=El-Diraby Tamer E.|date=2020-06-01|title=Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems|journal=Journal of Transportation Engineering, Part B: Pavements|volume=146|issue=2|pages=04020022|doi=10.1061/JPEODX.0000175| s2cid=216485629 }}</ref>
* '''Decision trees can approximate any [[Boolean function]] e.g. [[Exclusive or|XOR]].<ref>{{cite journal |first1=Dinesh |last1=Mehtaa |first2=Vijay |last2=Raghavan |title=Decision tree approximations of Boolean functions |journal=Theoretical Computer Science |volume=270 |issue=1–2 |year=2002 |pages=609–623 |doi=10.1016/S0304-3975(01)00011-1 |doi-access=free }}</ref>'''

===Limitations===
* Trees can be very non-robust. A small change in the [[Training, test, and validation sets|training data]] can result in a large change in the tree and consequently the final predictions.<ref name=":0" />
* The problem of learning an optimal decision tree is known to be [[NP-complete]] under several aspects of optimality and even for simple concepts.<ref>{{Cite journal | doi = 10.1016/0020-0190(76)90095-8 | last1 = Hyafil | first1 = Laurent | last2 = Rivest | first2 = RL | year = 1976 | title = Constructing Optimal Binary Decision Trees is NP-complete | journal = Information Processing Letters | volume = 5 | issue = 1| pages = 15–17 }}</ref><ref>Murthy S. (1998). [https://cs.nyu.edu/~roweis/csc2515-2006/readings/murthy_dt.pdf "Automatic construction of decision trees from data: A multidisciplinary survey"]. ''Data Mining and Knowledge Discovery''</ref> Consequently, practical decision-tree learning algorithms are based on heuristics such as the [[greedy algorithm]] where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. To reduce the greedy effect of local optimality, some methods such as the dual information distance (DID) tree were proposed.<ref>{{cite journal|url=http://www.eng.tau.ac.il/~bengal/DID.pdf|title=Efficient Construction of Decision Trees by the Dual Information Distance Method|author=Ben-Gal I. Dana A., Shkolnik N. and Singer|journal=Quality Technology & Quantitative Management|volume=11|issue=1|pages=133–147|year=2014|doi=10.1080/16843703.2014.11673330|s2cid=7025979|access-date=2014-02-13|archive-date=2016-06-04|archive-url=https://web.archive.org/web/20160604183738/http://www.eng.tau.ac.il/~bengal/DID.pdf|url-status=dead}}</ref>
* Decision-tree learners can create over-complex trees that do not generalize well from the training data.  (This is known as [[overfitting]].<ref>{{Cite book | title = Principles of Data Mining | doi = 10.1007/978-1-84628-766-4 | year = 2007 | isbn = 978-1-84628-765-7 | s2cid = 45746 }}</ref>)  Mechanisms such as [[Pruning (decision trees)|pruning]] are necessary to avoid this problem (with the exception of some algorithms such as the Conditional Inference approach, that does not require pruning).<ref name="Hothorn2006" /><ref name="Strobl2009" />
* The average depth of the tree that is defined by the number of nodes or tests till classification is not guaranteed to be minimal or small under various splitting criteria.<ref name="Tris">{{cite web|author = Ben-Gal I. and Trister C. (2015)|title = Parallel Construction of Decision Trees with Consistently Non Increasing Expected Number of Tests|url = http://www.eng.tau.ac.il/~bengal/Trist.pdf|publisher = Applied Stochastic Models in Business and Industry, Vol. 31(1) 64-78|access-date = 2021-01-30|archive-date = 2021-02-05|archive-url = https://web.archive.org/web/20210205043215/http://www.eng.tau.ac.il/~bengal/Trist.pdf|url-status = dead}}</ref> 
* For data including categorical variables with different numbers of levels, [[information gain in decision trees]] is biased in favor of attributes with more levels.<ref>{{cite conference|author=Deng, H.|author2=Runger, G. |author3=Tuv, E. |title=Bias of importance measures for multi-valued attributes and solutions|conference=Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN)|year=2011|pages= 293–300|url=https://www.researchgate.net/publication/221079908}}</ref> To counter this problem,  instead of choosing the attribute with highest [[information gain]], one can choose the attribute with the highest [[information gain ratio]] among the attributes whose information gain is greater than the mean information gain.<ref>{{cite journal |doi=10.1007/BF00116251 |last=Quinlan |first=J. Ross |title=Induction of Decision Trees |journal=[[Machine Learning (journal)|Machine Learning]] |volume=1 |issue=1 |year=1986 |pages=81–106 |doi-access=free }}</ref> This biases the decision tree against considering attributes with a large number of distinct values, while not giving an unfair advantage to attributes with very low information gain. Alternatively, the issue of biased predictor selection can be avoided by the Conditional Inference approach,<ref name="Hothorn2006" /> a two-stage approach,<ref>{{Cite journal|last1=Brandmaier|first1=Andreas M.|last2=Oertzen|first2=Timo von|last3=McArdle|first3=John J.|last4=Lindenberger|first4=Ulman|title=Structural equation model trees.|journal=Psychological Methods|language=en|volume=18|issue=1|pages=71–86|doi=10.1037/a0030001|pmid=22984789|pmc=4386908|year=2012|hdl=11858/00-001M-0000-0024-EA33-9}}</ref> or adaptive leave-one-out feature selection.<ref>{{cite journal|last1=Painsky|first1=Amichai|last2=Rosset|first2=Saharon|title=Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|date=2017|volume=39|issue=11|pages=2142–2153|pmid=28114007|doi=10.1109/TPAMI.2016.2636831|arxiv=1512.03444|s2cid=5381516}}</ref>

===Implementations===
Many data mining software packages provide implementations of one or more decision tree algorithms (e.g. random forest).

Open source examples include:

* [[ALGLIB]], a C++, C# and Java numerical analysis library with data analysis features (random forest)
* [[KNIME]], a free and open-source data analytics, reporting and integration platform (decision trees, random forest)
* [[Orange (software)|Orange]], an open-source data visualization, machine learning and data mining toolkit (random forest)
* [[R (programming language)|R]] (an open-source software environment for statistical computing, which includes several CART implementations such as rpart, party and randomForest packages),
*
* [[scikit-learn]] (a free and open-source machine learning library for the [[Python (programming language)|Python]] programming language).
* [[Weka (machine learning)|Weka]] (a free and open-source data-mining suite, contains many decision tree algorithms),

Notable commercial software:

* [[MATLAB]],
* [[Microsoft SQL Server]], and
* [[RapidMiner]],
*
* [[SAS (software)#Components|SAS Enterprise Miner]],
* [[SPSS Modeler|IBM SPSS Modeler]],