Editing Gene expression programming (section)

==The basic gene expression algorithm==
The fundamental steps of the basic gene expression algorithm are listed below in pseudocode:
#	Select function set;
#	Select terminal set;
#	Load dataset for fitness evaluation;
#	Create chromosomes of initial population randomly;
#	For each program in population:{{ordered list|list-style-type=lower-alpha
|	Express chromosome;
|	Execute program;
|	Evaluate fitness;}}
#	Verify stop condition;
#	Select programs;
#	Replicate selected programs to form the next population;
#	Modify chromosomes using genetic operators;
#	Go to step 5.
The first four steps prepare all the ingredients that are needed for the iterative loop of the algorithm (steps 5 through 10). Of these preparative steps, the crucial one is the creation of the initial population, which is created randomly using the elements of the function and terminal sets.

===Populations of programs===
Like all evolutionary algorithms, gene expression programming works with populations of individuals, which in this case are computer programs. Therefore, some kind of initial population must be created to get things started. Subsequent populations are descendants, via [[#Selection and elitism|selection]] and [[#Reproduction with modification|genetic modification]], of the initial population.

In the genotype/phenotype system of gene expression programming, it is only necessary to create the simple linear chromosomes of the individuals without worrying about the structural soundness of the programs they code for, as their expression always results in syntactically correct programs.

===Fitness functions and the selection environment===
Fitness functions and selection environments (called training datasets in [[machine learning]]) are the two facets of fitness and are therefore intricately connected. Indeed, the fitness of a program depends not only on the [[Loss function|cost function]] used to measure its performance but also on the training data chosen to evaluate fitness

====The selection environment or training data====
The selection environment consists of the set of training records, which are also called fitness cases. These fitness cases could be a set of observations or measurements concerning some problem, and they form what is called the training dataset.

The quality of the training data is essential for the evolution of good solutions. A good training set should be representative of the problem at hand and also well-balanced, otherwise the algorithm might get stuck at some local optimum. In addition, it is also important to avoid using unnecessarily large datasets for training as this will slow things down unnecessarily. A good rule of thumb is to choose enough records for training to enable a good generalization in the validation data and leave the remaining records for validation and testing.

====Fitness functions====
Broadly speaking, there are essentially three different kinds of problems based on the kind of prediction being made:
#	Problems involving numeric (continuous) predictions;
#	Problems involving categorical or nominal predictions, both binomial and multinomial;
#	Problems involving binary or Boolean predictions.
The first type of problem goes by the name of [[Regression analysis|regression]]; the second is known as [[Statistical classification|classification]], with [[logistic regression]] as a special case where, besides the crisp classifications like "Yes" or "No", a probability is also attached to each outcome; and the last one is related to [[Boolean algebra (logic)|Boolean algebra]] and [[logic synthesis]].

=====Fitness functions for regression=====
In [[Regression analysis|regression]], the response or dependent variable is numeric (usually continuous) and therefore the output of a regression model is also continuous. So it's quite straightforward to evaluate the fitness of the evolving models by comparing the output of the model to the value of the response in the training data.

There are several basic [[fitness function]]s for evaluating model performance, with the most common being based on the error or residual between the model output and the actual value. Such functions include the [[mean squared error]], [[root mean squared error]], [[mean absolute error]], relative squared error, root relative squared error, relative absolute error, and others.

All these standard measures offer a fine granularity or smoothness to the solution space and therefore work very well for most applications. But some problems might require a coarser evolution, such as determining if a prediction is within a certain interval, for instance less than 10% of the actual value. However, even if one is only interested in counting the hits (that is, a prediction that is within the chosen interval), making populations of models evolve based on just the number of hits each program scores is usually not very efficient due to the coarse granularity of the [[fitness landscape]]. Thus the solution usually involves combining these coarse measures with some kind of smooth function such as the standard error measures listed above.

Fitness functions based on the [[Pearson product-moment correlation coefficient|correlation coefficient]] and [[R-square]] are also very smooth. For regression problems, these functions work best by combining them with other measures because, by themselves, they only tend to measure [[Correlation and dependence|correlation]], not caring for the range of values of the model output. So by combining them with functions that work at approximating the range of the target values, they form very efficient fitness functions for finding models with good correlation and good fit between predicted and actual values.

=====Fitness functions for classification and logistic regression=====
The design of fitness functions for [[Statistical classification|classification]] and [[logistic regression]] takes advantage of three different characteristics of classification models. The most obvious is just counting the hits, that is, if a record is classified correctly it is counted as a hit. This fitness function is very simple and works well for simple problems, but for more complex problems or datasets highly unbalanced it gives poor results.

One way to improve this type of hits-based fitness function consists of expanding the notion of correct and incorrect classifications. In a binary classification task, correct classifications can be 00 or 11. The "00" representation means that a negative case (represented by "0”) was correctly classified, whereas the "11" means that a positive case (represented by "1”) was correctly classified. Classifications of the type "00" are called true negatives (TN) and "11" true positives (TP).

There are also two types of incorrect classifications and they are represented by 01 and 10. They are called false positives (FP) when the actual value is 0 and the model predicts a 1; and false negatives (FN) when the target is 1 and the model predicts a 0. The counts of TP, TN, FP, and FN are usually kept on a table known as the [[confusion matrix]].

{| class="infobox" 
|+ [[Confusion matrix]] for a binomial classification task.
! colspan="2" rowspan="2"| &nbsp; !! colspan="2" | Predicted class
|-
|-
! {{Yes}} || {{No}}
|-
! rowspan="2" {{vertical header|Actual<br/>class|va=middle}} || {{Yes}}
| style="text-align:center;" | TP || style="text-align:center;" |FN
|-
! {{No}}
| style="text-align:center;" | FP || style="text-align:center;" |TN
|}

So by counting the TP, TN, FP, and FN and further assigning different weights to these four types of classifications, it is possible to create smoother and therefore more efficient fitness functions. Some popular fitness functions based on the confusion matrix include [[Sensitivity and specificity|sensitivity/specificity]], [[Precision and recall|recall/precision]], [[F-measure]], [[Jaccard similarity]], [[Matthews correlation coefficient]], and cost/gain matrix which combines the costs and gains assigned to the 4 different types of classifications.

These functions based on the confusion matrix are quite sophisticated and are adequate to solve most problems efficiently. But there is another dimension to classification models which is key to exploring more efficiently the solution space and therefore results in the discovery of better classifiers. This new dimension involves exploring the structure of the model itself, which includes not only the domain and range, but also the distribution of the model output and the classifier margin.

By exploring this other dimension of classification models and then combining the information about the model with the confusion matrix, it is possible to design very sophisticated fitness functions that allow the smooth exploration of the solution space. For instance, one can combine some measure based on the confusion matrix with the [[mean squared error]] evaluated between the raw model outputs and the actual values. Or combine the [[F-measure]] with the [[R-square]] evaluated for the raw model output and the target; or the cost/gain matrix with the [[Pearson product-moment correlation coefficient|correlation coefficient]], and so on. More exotic fitness functions that explore model granularity include the area under the [[Receiver operating characteristic|ROC curve]] and rank measure.

Also related to this new dimension of classification models, is the idea of assigning probabilities to the model output, which is what is done in [[logistic regression]]. Then it is also possible to use these probabilities and evaluate the [[mean squared error]] (or some other similar measure) between the probabilities and the actual values, then combine this with the confusion matrix to create very efficient fitness functions for logistic regression. Popular examples of fitness functions based on the probabilities include [[maximum likelihood estimation]] and [[hinge loss]].

=====Fitness functions for Boolean problems=====
In logic there is no model structure (as defined [[#Fitness functions for classification and logistic regression|above]] for classification and logistic regression) to explore: the domain and range of logical functions comprises only 0's and 1's or false and true. So, the fitness functions available for [[Boolean algebra (logic)|Boolean algebra]] can only be based on the hits or on the confusion matrix as explained in the section [[#Fitness functions for classification and logistic regression|above]].

===Selection and elitism===
[[Roulette-wheel selection]] is perhaps the most popular selection scheme used in evolutionary computation. It involves mapping the fitness of each program to a slice of the roulette wheel proportional to its fitness. Then the roulette is spun as many times as there are programs in the population in order to keep the population size constant. So, with roulette-wheel selection programs are selected both according to fitness and the luck of the draw, which means that some times the best traits might be lost. However, by combining roulette-wheel selection with the cloning of the best program of each generation, one guarantees that at least the very best traits are not lost. This technique of cloning the best-of-generation program is known as simple elitism and is used by most stochastic selection schemes.

===Reproduction with modification===
The reproduction of programs involves first the selection and then the reproduction of their genomes. Genome modification is not required for reproduction, but without it adaptation and evolution won't take place.

====Replication and selection====
The selection operator selects the programs for the replication operator to copy. Depending on the selection scheme, the number of copies one program originates may vary, with some programs getting copied more than once while others are copied just once or not at all. In addition, selection is usually set up so that the population size remains constant from one generation to another.

The replication of genomes in nature is very complex and it took scientists a long time to discover the [[DNA double helix]] and propose a mechanism for its replication. But the replication of strings is trivial in artificial evolutionary systems, where only an instruction to copy strings is required to pass all the information in the genome from generation to generation.

The replication of the selected programs is a fundamental piece of all artificial evolutionary systems, but for evolution to occur it needs to be implemented not with the usual precision of a copy instruction, but rather with a few errors thrown in. Indeed, genetic diversity is created with [[genetic operator]]s such as [[Mutation (genetic algorithm)|mutation]], [[Crossover (genetic algorithm)|recombination]], [[Transposition (genetics)|transposition]], inversion, and many others.

====Mutation====
In gene expression programming mutation is the most important genetic operator.<ref>{{cite web|last=Ferreira|first=C.|year=2002|title=Mutation, Transposition, and Recombination: An Analysis of the Evolutionary Dynamics|url= http://www.gene-expression-programming.com/webpapers/ferreira-fea02.pdf|publisher=  In H. J. Caulfield, S.-H. Chen, H.-D. Cheng, R. Duro, V. Honavar, E. E. Kerre, M. Lu, M. G. Romay, T. K. Shih, D. Ventura, P. P. Wang, Y. Yang, eds., Proceedings of the 6th Joint Conference on Information Sciences, 4th International Workshop on Frontiers in Evolutionary Algorithms, pages 614–617, Research Triangle Park, North Carolina, USA}}</ref> It changes genomes by changing an element by another. The accumulation of many small changes over time can create great diversity.

In gene expression programming mutation is totally unconstrained, which means that in each gene domain any domain symbol can be replaced by another. For example, in the heads of genes any function can be replaced by a terminal or another function, regardless of the number of arguments in this new function; and a terminal can be replaced by a function or another terminal.

====Recombination====
[[Crossover (genetic algorithm)|Recombination]] usually involves two parent chromosomes to create two new chromosomes by combining different parts from the parent chromosomes. And as long as the parent chromosomes are aligned and the exchanged fragments are homologous (that is, occupy the same position in the chromosome), the new chromosomes created by recombination will always encode syntactically correct programs.

Different kinds of crossover are easily implemented either by changing the number of parents involved (there's no reason for choosing only two); the number of split points; or the way one chooses to exchange the fragments, for example, either randomly or in some orderly fashion. For example, gene recombination, which is a special case of recombination, can be done by exchanging homologous genes (genes that occupy the same position in the chromosome) or by exchanging genes chosen at random from any position in the chromosome.

====Transposition====
[[Transposition (genetics)|Transposition]] involves the introduction of an insertion sequence somewhere in a chromosome. In gene expression programming insertion sequences might appear anywhere in the chromosome, but they are only inserted in the heads of genes. This method guarantees that even insertion sequences from the tails result in error-free programs.

For transposition to work properly, it must preserve chromosome length and gene structure. So, in gene expression programming transposition can be implemented using two different methods: the first creates a shift at the insertion site, followed by a deletion at the end of the head; the second overwrites the local sequence at the target site and therefore is easier to implement. Both methods can be implemented to operate between chromosomes or within a chromosome or even within a single gene.

====Inversion====
Inversion is an interesting operator, especially powerful for [[combinatorial optimization]].<ref>{{cite web|last=Ferreira|first=C.|year=2002|title=Combinatorial Optimization by Gene Expression Programming: Inversion Revisited|url= http://www.gene-expression-programming.com/webpapers/Ferreira-ASAI02.pdf|publisher= In J. M. Santos and A. Zapico, eds., Proceedings of the Argentine Symposium on Artificial Intelligence, pages 160–174, Santa Fe, Argentina}}</ref> It consists of inverting a small sequence within a chromosome.

In gene expression programming it can be easily implemented in all gene domains and, in all cases, the offspring produced is always syntactically correct. For any gene domain, a sequence (ranging from at least two elements to as big as the domain itself) is chosen at random within that domain and then inverted.

====Other genetic operators====
Several other genetic operators exist and in gene expression programming, with its different genes and gene domains, the possibilities are endless. For example, genetic operators such as one-point recombination, two-point recombination, gene recombination, uniform recombination, gene transposition, root transposition, domain-specific mutation, domain-specific inversion, domain-specific transposition, and so on, are easily implemented and widely used.