Editing Decision tree learning (section)

== Decision tree types ==
Decision trees used in [[data mining]] are of two main types:
* '''[[Classification tree]]''' analysis is when the predicted outcome is the class (discrete) to which the data belongs.
* '''Regression tree''' analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient's length of stay in a hospital).

The term '''classification and regression tree (CART)''' analysis is an [[umbrella term]] used to refer to either of the above procedures, first introduced by [[Leo Breiman|Breiman]] et al. in 1984.<ref name="bfos">{{Cite book
|last=Breiman
|first=Leo
|author2=Friedman, J. H. |author3=Olshen, R. A. |author4=Stone, C. J.
|title=Classification and regression trees
|year=1984
|publisher=Wadsworth & Brooks/Cole Advanced Books & Software
|location=Monterey, CA
|isbn=978-0-412-04841-8
}}</ref> Trees used for regression and trees used for classification have some similarities – but also some differences, such as the procedure used to determine where to split.<ref name="bfos"/>

Some techniques, often called ''ensemble'' methods, construct more than one decision tree:
* '''[[Gradient boosted trees|Boosted trees]]''' Incrementally building an ensemble by training each new instance to emphasize the training instances previously mis-modeled. A typical example is [[AdaBoost]]. These can be used for regression-type and classification-type problems.<ref>Friedman, J. H. (1999). ''[https://astro.temple.edu/~msobel/courses_files/StochasticBoosting(gradient).pdf Stochastic gradient boosting] {{Webarchive|url=https://web.archive.org/web/20181128041212/https://astro.temple.edu/~msobel/courses_files/StochasticBoosting(gradient).pdf |date=2018-11-28 }}.'' Stanford University.</ref><ref>Hastie, T., Tibshirani, R., Friedman, J. H. (2001). ''The elements of statistical learning : Data mining, inference, and prediction.'' New York: Springer Verlag.</ref>
* '''Committees of decision trees''' (also called k-DT<ref>Heath, D., Kasif, S. and Salzberg, S. (1993). ''k-DT: A multi-tree learning method.'' In ''Proceedings of the Second Intl. Workshop on Multistrategy Learning'', pp. 138-149.</ref>), an early method that used randomized decision tree algorithms to generate multiple different trees from the training data, and then combine them using majority voting to generate output.<ref>Heath, D., Kasif, S., and Salzberg, S. L. (1996). ''Committees of decision trees.''  In B. Gorayska and J. Mey (Eds.), ''Cognitive Technology: In Search of a Humane Interface'' (pp. 305–317).  Amsterdam:  Elsevier Science B.V.</ref>
* '''[[Bootstrap aggregating|Bootstrap aggregated]]''' (or bagged) decision trees, an early ensemble method, builds multiple decision trees by repeatedly [[Bootstrapping (statistics)|resampling training data with replacement]], and voting the trees for a consensus prediction.<ref>{{cite journal |last=Breiman |first=L. |year=1996 |title=Bagging Predictors |journal=Machine Learning |volume=24 |issue=2 |pages=123–140 |doi=10.1007/BF00058655 |doi-access=free }}</ref>
** A '''[[random forest]]''' classifier is a specific type of [[bootstrap aggregating]]
* '''Rotation forest''' – in which every decision tree is trained by first applying [[principal component analysis]] (PCA) on a random subset of the input features.<ref>{{cite journal |last1=Rodriguez |first1=J. J. |last2=Kuncheva |first2=L. I.|author2-link=Ludmila Kuncheva |last3=Alonso |first3=C. J. |year=2006 |title=Rotation forest: A new classifier ensemble method |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=28 |issue=10 |pages=1619–1630 |doi=10.1109/TPAMI.2006.211 |pmid=16986543 |citeseerx=10.1.1.156.8277 |s2cid=6847493 }}</ref>

A special case of a decision tree is a [[decision list]],<ref>{{cite journal|last1=Rivest|first1=Ron|title=Learning Decision Lists|journal=Machine Learning|date=Nov 1987|volume=3|issue=2|pages=229–246|doi=10.1023/A:1022607331053|s2cid=30625841|url=http://people.csail.mit.edu/rivest/pubs/Riv87b.pdf|doi-access=free}}</ref> which is a one-sided decision tree, so that every internal node has exactly 1 leaf node and exactly 1 internal node as a child (except for the bottommost node, whose only child is a single leaf node).  While less expressive, decision lists are arguably easier to understand than general decision trees due to their added sparsity{{citation needed|date=December 2021}}, permit non-greedy learning methods<ref>{{cite journal|last1=Letham|first1=Ben|last2=Rudin|first2=Cynthia|author2-link= Cynthia Rudin |last3=McCormick|first3=Tyler|last4=Madigan|first4=David|title=Interpretable Classifiers Using Rules And Bayesian Analysis: Building A Better Stroke Prediction Model|journal=Annals of Applied Statistics|date=2015|volume=9|issue=3|pages=1350–1371|doi=10.1214/15-AOAS848|arxiv=1511.01644|s2cid=17699665}}</ref> and monotonic constraints to be imposed.<ref>{{cite journal|last1=Wang|first1=Fulton|last2=Rudin|first2=Cynthia|author2-link=Cynthia Rudin|title=Falling Rule Lists|journal=Journal of Machine Learning Research|date=2015|volume=38|url=http://www.jmlr.org/proceedings/papers/v38/wang15a.pdf|access-date=2016-01-22|archive-date=2016-01-28|archive-url=https://web.archive.org/web/20160128223950/http://www.jmlr.org/proceedings/papers/v38/wang15a.pdf|url-status=dead}}</ref>

Notable decision tree algorithms include:
* [[ID3 algorithm|ID3]] (Iterative Dichotomiser 3)
* [[C4.5 algorithm|C4.5]] (successor of ID3)
* CART (Classification And Regression Tree)<ref name="bfos" />
* OC1 (Oblique classifier 1). First method that created multivariate splits at each node.<ref>{{Cite journal | doi = 10.1613/jair.63 | last1 = Murthy | first1 = S. K. | year = 1994 | title = A System for Induction of Oblique Decision Trees | journal = Journal of Artificial Intelligence Research | volume = 2 | issue = 1 | pages = 1–32 | doi-access = free }}</ref>
* [[Chi-square automatic interaction detection]] (CHAID). Performs multi-level splits when computing classification trees.<ref>{{Cite journal | doi = 10.2307/2986296 | last1 = Kass | first1 = G. V. | year = 1980 | title = An exploratory technique for investigating large quantities of categorical data | jstor = 2986296| journal = Applied Statistics | volume = 29 | issue = 2| pages = 119–127 }}</ref><ref>{{Cite journal|last1=Biggs|first1=David|last2=De Ville|first2=Barry|last3=Suen|first3=Ed|date=1991|title=A method of choosing multiway partitions for classification and decision trees|url=https://doi.org/10.1080/02664769100000005|journal=Journal of Applied Statistics|volume=18|issue=1|pages=49–62|doi=10.1080/02664769100000005|bibcode=1991JApSt..18...49B |issn=0266-4763}}</ref><ref>Ritschard, G. (2013),  '''"'''CHAID and Earlier Supervised Tree Methods",  in J.J. McArdle and G. Ritschard (eds), ''Contemporary Issues in Exploratory Data Mining in the Behavioral Sciences'', Quantitative Methodology Series, New York: Routledge, pages 48-74. [https://www.researchgate.net/publication/315476407_CHAID_and_Earlier_Supervised_Tree_Methods Preprint]</ref>
* [[Multivariate adaptive regression splines|MARS]]: extends decision trees to handle numerical data better.
* Conditional Inference Trees. Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid overfitting. This approach results in unbiased predictor selection and does not require pruning.<ref name="Hothorn2006">{{Cite journal | doi = 10.1198/106186006X133933 | last1 = Hothorn | first1 = T.| last2 = Hornik | first2 = K.| last3 = Zeileis | first3 = A.|  year = 2006 | title = Unbiased Recursive Partitioning: A Conditional Inference Framework | jstor = 27594202| journal = Journal of Computational and Graphical Statistics | volume = 15 | issue = 3| pages = 651–674 | citeseerx = 10.1.1.527.2935 | s2cid = 6074128 }}</ref><ref name="Strobl2009">{{Cite journal | doi = 10.1037/a0016973 | pmid = 19968396 | pmc = 2927982 | last1 = Strobl | first1 = C.| last2 = Malley | first2 = J.| last3 = Tutz | first3 = G.|  year = 2009 | title = An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests | journal = Psychological Methods | volume = 14 | issue = 4| pages = 323–348 }}</ref>

ID3 and CART were invented independently at around the same time (between 1970 and 1980){{Citation needed|date=August 2014}}, yet follow a similar approach for learning a decision tree from training tuples.

It has also been proposed to leverage concepts of [[fuzzy set theory]] for the definition of a special version of decision tree, known as Fuzzy Decision Tree (FDT).<ref name="Janikow1998">{{Cite journal | doi=10.1109/3477.658573| title=Fuzzy decision trees: issues and methods| journal=IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics| volume=28| pages=1–14| year=1998| last1=Janikow| first1=C. Z.| issue=1| pmid=18255917}}</ref>
In this type of fuzzy classification, generally, an input vector <math>\textbf{x}</math> is associated with multiple classes, each with a different confidence value.
Boosted ensembles of FDTs have been recently investigated as well, and they have shown performances comparable to those of other very efficient fuzzy classifiers.<ref name="Barsacchi2020">{{Cite journal | url=http://www.sciencedirect.com/science/article/pii/S0957417420302608 | doi=10.1016/j.eswa.2020.113436|
title=An analysis of boosted ensembles of binary fuzzy decision trees| journal=Expert Systems with Applications|
volume=154| year=2020| last1=Barsacchi| first1=M.|last2=Bechini| first2=A.|last3=Marcelloni| first3=F.| page=113436| s2cid=216369273}}</ref>