Editing Gene expression programming (section)

====Fitness functions====
Broadly speaking, there are essentially three different kinds of problems based on the kind of prediction being made:
#	Problems involving numeric (continuous) predictions;
#	Problems involving categorical or nominal predictions, both binomial and multinomial;
#	Problems involving binary or Boolean predictions.
The first type of problem goes by the name of [[Regression analysis|regression]]; the second is known as [[Statistical classification|classification]], with [[logistic regression]] as a special case where, besides the crisp classifications like "Yes" or "No", a probability is also attached to each outcome; and the last one is related to [[Boolean algebra (logic)|Boolean algebra]] and [[logic synthesis]].

=====Fitness functions for regression=====
In [[Regression analysis|regression]], the response or dependent variable is numeric (usually continuous) and therefore the output of a regression model is also continuous. So it's quite straightforward to evaluate the fitness of the evolving models by comparing the output of the model to the value of the response in the training data.

There are several basic [[fitness function]]s for evaluating model performance, with the most common being based on the error or residual between the model output and the actual value. Such functions include the [[mean squared error]], [[root mean squared error]], [[mean absolute error]], relative squared error, root relative squared error, relative absolute error, and others.

All these standard measures offer a fine granularity or smoothness to the solution space and therefore work very well for most applications. But some problems might require a coarser evolution, such as determining if a prediction is within a certain interval, for instance less than 10% of the actual value. However, even if one is only interested in counting the hits (that is, a prediction that is within the chosen interval), making populations of models evolve based on just the number of hits each program scores is usually not very efficient due to the coarse granularity of the [[fitness landscape]]. Thus the solution usually involves combining these coarse measures with some kind of smooth function such as the standard error measures listed above.

Fitness functions based on the [[Pearson product-moment correlation coefficient|correlation coefficient]] and [[R-square]] are also very smooth. For regression problems, these functions work best by combining them with other measures because, by themselves, they only tend to measure [[Correlation and dependence|correlation]], not caring for the range of values of the model output. So by combining them with functions that work at approximating the range of the target values, they form very efficient fitness functions for finding models with good correlation and good fit between predicted and actual values.

=====Fitness functions for classification and logistic regression=====
The design of fitness functions for [[Statistical classification|classification]] and [[logistic regression]] takes advantage of three different characteristics of classification models. The most obvious is just counting the hits, that is, if a record is classified correctly it is counted as a hit. This fitness function is very simple and works well for simple problems, but for more complex problems or datasets highly unbalanced it gives poor results.

One way to improve this type of hits-based fitness function consists of expanding the notion of correct and incorrect classifications. In a binary classification task, correct classifications can be 00 or 11. The "00" representation means that a negative case (represented by "0”) was correctly classified, whereas the "11" means that a positive case (represented by "1”) was correctly classified. Classifications of the type "00" are called true negatives (TN) and "11" true positives (TP).

There are also two types of incorrect classifications and they are represented by 01 and 10. They are called false positives (FP) when the actual value is 0 and the model predicts a 1; and false negatives (FN) when the target is 1 and the model predicts a 0. The counts of TP, TN, FP, and FN are usually kept on a table known as the [[confusion matrix]].

{| class="infobox" 
|+ [[Confusion matrix]] for a binomial classification task.
! colspan="2" rowspan="2"| &nbsp; !! colspan="2" | Predicted class
|-
|-
! {{Yes}} || {{No}}
|-
! rowspan="2" {{vertical header|Actual<br/>class|va=middle}} || {{Yes}}
| style="text-align:center;" | TP || style="text-align:center;" |FN
|-
! {{No}}
| style="text-align:center;" | FP || style="text-align:center;" |TN
|}

So by counting the TP, TN, FP, and FN and further assigning different weights to these four types of classifications, it is possible to create smoother and therefore more efficient fitness functions. Some popular fitness functions based on the confusion matrix include [[Sensitivity and specificity|sensitivity/specificity]], [[Precision and recall|recall/precision]], [[F-measure]], [[Jaccard similarity]], [[Matthews correlation coefficient]], and cost/gain matrix which combines the costs and gains assigned to the 4 different types of classifications.

These functions based on the confusion matrix are quite sophisticated and are adequate to solve most problems efficiently. But there is another dimension to classification models which is key to exploring more efficiently the solution space and therefore results in the discovery of better classifiers. This new dimension involves exploring the structure of the model itself, which includes not only the domain and range, but also the distribution of the model output and the classifier margin.

By exploring this other dimension of classification models and then combining the information about the model with the confusion matrix, it is possible to design very sophisticated fitness functions that allow the smooth exploration of the solution space. For instance, one can combine some measure based on the confusion matrix with the [[mean squared error]] evaluated between the raw model outputs and the actual values. Or combine the [[F-measure]] with the [[R-square]] evaluated for the raw model output and the target; or the cost/gain matrix with the [[Pearson product-moment correlation coefficient|correlation coefficient]], and so on. More exotic fitness functions that explore model granularity include the area under the [[Receiver operating characteristic|ROC curve]] and rank measure.

Also related to this new dimension of classification models, is the idea of assigning probabilities to the model output, which is what is done in [[logistic regression]]. Then it is also possible to use these probabilities and evaluate the [[mean squared error]] (or some other similar measure) between the probabilities and the actual values, then combine this with the confusion matrix to create very efficient fitness functions for logistic regression. Popular examples of fitness functions based on the probabilities include [[maximum likelihood estimation]] and [[hinge loss]].

=====Fitness functions for Boolean problems=====
In logic there is no model structure (as defined [[#Fitness functions for classification and logistic regression|above]] for classification and logistic regression) to explore: the domain and range of logical functions comprises only 0's and 1's or false and true. So, the fitness functions available for [[Boolean algebra (logic)|Boolean algebra]] can only be based on the hits or on the confusion matrix as explained in the section [[#Fitness functions for classification and logistic regression|above]].