Editing Gene expression programming (section)

==Decision trees==
[[Decision trees]] (DT) are classification models where a series of questions and answers are mapped using nodes and directed edges.

Decision trees have three types of nodes: a root node, internal nodes, and leaf or terminal nodes. The root node and all internal nodes represent test conditions for different attributes or variables in a dataset. Leaf nodes specify the class label for all different paths in the tree.

Most decision tree induction algorithms involve selecting an attribute for the root node and then make the same kind of informed decision about all the nodes in a tree.

Decision trees can also be created by gene expression programming,<ref>{{cite book |last=Ferreira |first=C. |title=Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence |publisher=Springer-Verlag |year=2006 |isbn=3-540-32796-7}}</ref> with the advantage that all the decisions concerning the growth of the tree are made by the algorithm itself without any kind of human input.

There are basically two different types of DT algorithms: one for inducing decision trees with only nominal attributes and another for inducing decision trees with both numeric and nominal attributes. This aspect of decision tree induction also carries to gene expression programming and there are two GEP algorithms for decision tree induction: the evolvable decision trees (EDT) algorithm for dealing exclusively with nominal attributes and the EDT-RNC (EDT with random numerical constants) for handling both nominal and numeric attributes.

In the decision trees induced by gene expression programming, the attributes behave as function nodes in the [[gene expression programming#The basic gene expression algorithm|basic gene expression algorithm]], whereas the class labels behave as terminals. This means that attribute nodes have also associated with them a specific arity or number of branches that will determine their growth and, ultimately, the growth of the tree. Class labels behave like terminals, which means that for a ''k''-class classification task, a terminal set with ''k'' terminals is used, representing the ''k'' different classes.

The rules for encoding a decision tree in a linear genome are very similar to the rules used to encode mathematical expressions (see [[gene expression programming#K-expressions and genes|above]]). So, for decision tree induction the genes also have a head and a tail, with the head containing attributes and terminals and the tail containing only terminals. This again ensures that all decision trees designed by GEP are always valid programs. Furthermore, the size of the tail ''t'' is also dictated by the head size ''h'' and the number of branches of the attribute with more branches ''n''<sub>max</sub> and is evaluated by the equation:

:<math>t = h(n_\max-1)+1 \, </math>

For example, consider the decision tree below to decide whether to play outside:

{| align="center" border="0" cellpadding="4" cellspacing="0"
| [[File:Decision tree for playing outside.png]]
|}

It can be linearly encoded as:

:<code><nowiki>01234567</nowiki></code>
:
:<code><nowiki>HOWbaaba</nowiki></code>

where “H” represents the attribute Humidity, “O” the attribute Outlook, “W” represents Windy, and “a” and “b” the class labels "Yes" and "No" respectively. Note that the edges connecting the nodes are properties of the data, specifying the type and number of branches of each attribute, and therefore don't have to be encoded.

The process of decision tree induction with gene expression programming starts, as usual, with an initial population of randomly created chromosomes. Then the chromosomes are expressed as decision trees and their fitness evaluated against a training dataset. According to fitness they are then selected to reproduce with modification. The genetic operators are exactly the same that are used in a conventional unigenic system, for example, [[gene expression programming#Mutation|mutation]], [[gene expression programming#Inversion|inversion]], [[gene expression programming#Transposition|transposition]], and [[gene expression programming#Recombination|recombination]].

Decision trees with both nominal and numeric attributes are also easily induced with gene expression programming using the framework described [[gene expression programming#The GEP-RNC algorithm|above]] for dealing with random numerical constants. The chromosomal architecture includes an extra domain for encoding random numerical constants, which are used as thresholds for splitting the data at each branching node. For example, the gene below with a head size of 5 (the Dc starts at position 16):

:<code><nowiki>012345678901234567890</nowiki></code>
:
:<code><nowiki>WOTHabababbbabba46336</nowiki></code>

encodes the decision tree shown below:

{| align="center" border="0" cellpadding="4" cellspacing="0"
| [[File:GEP decision tree, k-expression WOTHababab.png]]
|}

In this system, every node in the head, irrespective of its type (numeric attribute, nominal attribute, or terminal), has associated with it a random numerical constant, which for simplicity in the example above is represented by a numeral 0–9. These random numerical constants are encoded in the Dc domain and their expression follows a very simple scheme: from top to bottom and from left to right, the elements in Dc are assigned one-by-one to the elements in the decision tree. So, for the following array of RNCs:

: ''C'' = {62, 51, 68, 83, 86, 41, 43, 44, 9, 67}

the decision tree above results in:

{| align="center" border="0" cellpadding="4" cellspacing="0"
| [[File:GEP decision tree with numeric and nominal attributes, k-expression WOTHababab.png]]
|}

which can also be represented more colorfully as a conventional decision tree:

{| align="center" border="0" cellpadding="4" cellspacing="0"
| [[File:GEP decision tree with numeric and nominal attributes.png]]
|}