Editing Gene expression programming (section)

=====Fitness functions for classification and logistic regression=====
The design of fitness functions for [[Statistical classification|classification]] and [[logistic regression]] takes advantage of three different characteristics of classification models. The most obvious is just counting the hits, that is, if a record is classified correctly it is counted as a hit. This fitness function is very simple and works well for simple problems, but for more complex problems or datasets highly unbalanced it gives poor results.

One way to improve this type of hits-based fitness function consists of expanding the notion of correct and incorrect classifications. In a binary classification task, correct classifications can be 00 or 11. The "00" representation means that a negative case (represented by "0”) was correctly classified, whereas the "11" means that a positive case (represented by "1”) was correctly classified. Classifications of the type "00" are called true negatives (TN) and "11" true positives (TP).

There are also two types of incorrect classifications and they are represented by 01 and 10. They are called false positives (FP) when the actual value is 0 and the model predicts a 1; and false negatives (FN) when the target is 1 and the model predicts a 0. The counts of TP, TN, FP, and FN are usually kept on a table known as the [[confusion matrix]].

{| class="infobox" 
|+ [[Confusion matrix]] for a binomial classification task.
! colspan="2" rowspan="2"| &nbsp; !! colspan="2" | Predicted class
|-
|-
! {{Yes}} || {{No}}
|-
! rowspan="2" {{vertical header|Actual<br/>class|va=middle}} || {{Yes}}
| style="text-align:center;" | TP || style="text-align:center;" |FN
|-
! {{No}}
| style="text-align:center;" | FP || style="text-align:center;" |TN
|}

So by counting the TP, TN, FP, and FN and further assigning different weights to these four types of classifications, it is possible to create smoother and therefore more efficient fitness functions. Some popular fitness functions based on the confusion matrix include [[Sensitivity and specificity|sensitivity/specificity]], [[Precision and recall|recall/precision]], [[F-measure]], [[Jaccard similarity]], [[Matthews correlation coefficient]], and cost/gain matrix which combines the costs and gains assigned to the 4 different types of classifications.

These functions based on the confusion matrix are quite sophisticated and are adequate to solve most problems efficiently. But there is another dimension to classification models which is key to exploring more efficiently the solution space and therefore results in the discovery of better classifiers. This new dimension involves exploring the structure of the model itself, which includes not only the domain and range, but also the distribution of the model output and the classifier margin.

By exploring this other dimension of classification models and then combining the information about the model with the confusion matrix, it is possible to design very sophisticated fitness functions that allow the smooth exploration of the solution space. For instance, one can combine some measure based on the confusion matrix with the [[mean squared error]] evaluated between the raw model outputs and the actual values. Or combine the [[F-measure]] with the [[R-square]] evaluated for the raw model output and the target; or the cost/gain matrix with the [[Pearson product-moment correlation coefficient|correlation coefficient]], and so on. More exotic fitness functions that explore model granularity include the area under the [[Receiver operating characteristic|ROC curve]] and rank measure.

Also related to this new dimension of classification models, is the idea of assigning probabilities to the model output, which is what is done in [[logistic regression]]. Then it is also possible to use these probabilities and evaluate the [[mean squared error]] (or some other similar measure) between the probabilities and the actual values, then combine this with the confusion matrix to create very efficient fitness functions for logistic regression. Popular examples of fitness functions based on the probabilities include [[maximum likelihood estimation]] and [[hinge loss]].