Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Decision tree learning
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Measure of "goodness"=== Used by CART in 1984,<ref name="ll">{{Cite book |last=Larose |first=Daniel T. |author2=Larose, Chantal D. |title=Discovering knowledge in data: an introduction to data mining |year=2014 |publisher=John Wiley & Sons, Inc |location=Hoboken, NJ |isbn=9781118874059 }}</ref> the measure of "goodness" is a function that seeks to optimize the balance of a candidate split's capacity to create pure children with its capacity to create equally-sized children. This process is repeated for each impure node until the tree is complete. The function <math>\varphi(s\mid t)</math>, where <math>s</math> is a candidate split at node <math>t</math>, is defined as below :<math> \varphi(s\mid t) = 2P_L P_R \sum_{j=1}^\text{class count}|P(j\mid t_L) - P(j\mid t_R)| </math> where <math>t_L</math> and <math>t_R</math> are the left and right children of node <math>t</math> using split <math>s</math>, respectively; <math>P_L</math> and <math>P_R</math> are the proportions of records in <math>t</math> in <math>t_L</math> and <math>t_R</math>, respectively; and <math>P(j\mid t_L)</math> and <math>P(j\mid t_R)</math> are the proportions of class <math>j</math> records in <math>t_L</math> and <math>t_R</math>, respectively. Consider an example data set with three attributes: ''savings''(low, medium, high), ''assets''(low, medium, high), ''income''(numerical value), and a binary target variable ''credit risk''(good, bad) and 8 data points.<ref name="ll"/> The full data is presented in the table below. To start a decision tree, we will calculate the maximum value of <math>\varphi(s\mid t)</math> using each feature to find which one will split the root node. This process will continue until all children are pure or all <math>\varphi(s\mid t)</math> values are below a set threshold. {| class="wikitable" |- ! Customer !! Savings !! Assets !! Income ($1000s) !! Credit risk |- | 1 || Medium || High || 75 || Good |- | 2 || Low || Low || 50 || Bad |- | 3 || High || Medium || 25 || Bad |- | 4 || Medium || Medium || 50 || Good |- | 5 || Low || Medium || 100 || Good |- | 6 || High || High || 25 || Good |- | 7 || Low || Low || 25 || Bad |- | 8 || Medium || Medium || 75 || Good |} To find <math>\varphi(s\mid t)</math> of the feature ''savings'', we need to note the quantity of each value. The original data contained three low's, three medium's, and two high's. Out of the low's, one had a good ''credit risk'' while out of the medium's and high's, 4 had a good ''credit risk''. Assume a candidate split <math>s</math> such that records with a low ''savings'' will be put in the left child and all other records will be put into the right child. :<math> \varphi(s\mid\text{root}) = 2\cdot\frac 3 8\cdot\frac 5 8\cdot \left(\left|\left(\frac 1 3 - \frac 4 5\right)\right| + \left|\left(\frac 2 3 - \frac 1 5\right)\right|\right) = 0.44 </math> To build the tree, the "goodness" of all candidate splits for the root node need to be calculated. The candidate with the maximum value will split the root node, and the process will continue for each impure node until the tree is complete. Compared to other metrics such as information gain, the measure of "goodness" will attempt to create a more balanced tree, leading to more-consistent decision time. However, it sacrifices some priority for creating pure children which can lead to additional splits that are not present with other metrics.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)