Feature selection

Template:Short description Template:More footnotes needed Template:Distinguish

In machine learning, feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons:

simplification of models to make them easier to interpret,<ref name="islr">Template:Cite book</ref>
shorter training times,<ref>Template:Citation</ref>
to avoid the curse of dimensionality,<ref>Template:Cite journal</ref>
improve the compatibility of the data with a certain learning model class,<ref>Template:Cite journal</ref>
to encode inherent symmetries present in the input space.<ref>Template:Cite book</ref><ref>Template:Cite book</ref><ref>Template:Cite journal</ref><ref>Template:Cite journal</ref>

The central premise when using feature selection is that data sometimes contains features that are redundant or irrelevant, and can thus be removed without incurring much loss of information.<ref name="Bermingham-prolog">Template:Cite journal</ref> Redundancy and irrelevance are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated.Template:R

Feature extraction creates new features from functions of the original features, whereas feature selection finds a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (data points).

IntroductionEdit

A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.<ref name="guyon-intro">Template:Cite journal</ref>

Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem.
Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the mutual information,<ref name="guyon-intro"/> the pointwise mutual information,<ref name="textcat"/> Pearson product-moment correlation coefficient, Relief-based algorithms,<ref>Template:Cite journal</ref> and inter/intra class distance or the scores of significance tests for each class/feature combinations.<ref name="textcat">Template:Cite conference</ref><ref>Template:Cite journal</ref> Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model.<ref>Template:Cite journal</ref> This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. One other popular approach is the Recursive Feature Elimination algorithm,<ref>Template:Cite journal</ref> commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights.
Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples;<ref name=Bolasso>Template:Cite book</ref> Elastic net regularization, which combines the L1 penalty of LASSO with the L2 penalty of ridge regression; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients.<ref name=FeaLect>Template:Cite journal</ref> AEFS further extends LASSO to nonlinear scenario with autoencoders.<ref>Template:Cite conference</ref> These approaches tend to be between filters and wrappers in terms of computational complexity.

In traditional regression analysis, the most popular form of feature selection is stepwise regression, which is a wrapper technique. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network.

Subset selectionEdit

Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model.

Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc.

Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected.

Search approaches include:

Exhaustive<ref>Template:Cite arXiv</ref>
Best first
Simulated annealing
Genetic algorithm<ref>Template:Cite journal</ref>
Greedy forward selection<ref>Template:Cite journal</ref><ref>Template:Cite conference</ref><ref>Template:Cite journal</ref>
Greedy backward elimination
Particle swarm optimization<ref name="sciencedirect.com">Template:Cite journal</ref>
Targeted projection pursuit
Scatter search<ref>F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. Solving feature subset selection problem by a Parallel Scatter Search, European Journal of Operational Research, vol. 169, no. 2, pp. 477–489, 2006.

</ref><ref>Template:Cite book</ref><ref>M. Garcia-Torres. Feature selection for high-dimensional data using a multivariate search space reduction strategy based scatter search, Journal of Heuristics, vol. 1, no 31, 2025.</ref>

Variable neighborhood search<ref>F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. Solving Feature Subset Selection Problem by a Hybrid Metaheuristic. In First International Workshop on Hybrid Metaheuristics, pp. 59–68, 2004.</ref><ref>M. Garcia-Torres, F. Gomez-Vela, B. Melian, J.M. Moreno-Vega. High-dimensional feature selection via feature grouping: A Variable Neighborhood Search approach, Information Sciences, vol. 326, pp. 102-118, 2016.</ref>

Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information;<ref>Template:Cite journal</ref> see here.

Other available filter metrics include:

Class separability
- Error probability
- Inter-class distance
- Probabilistic distance
- Entropy
Consistency-based feature selection
Correlation-based feature selection

Optimality criteriaEdit

The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples include Akaike information criterion (AIC) and Mallows's C_p, which have a penalty of 2 for each added feature. AIC is based on information theory, and is effectively derived via the maximum entropy principle.<ref>Template:Citation.</ref><ref>Template:Citation.</ref>

Other criteria are Bayesian information criterion (BIC), which uses a penalty of <math>\sqrt{\log{n}}</math> for each added feature, minimum description length (MDL) which asymptotically uses <math>\sqrt{\log{n}}</math>, Bonferroni / RIC which use <math>\sqrt{2\log{p}}</math>, maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate (FDR), which use something close to <math>\sqrt{2\log{\frac{p}{q}}}</math>. A maximum entropy rate criterion may also be used to select the most relevant subset of features.<ref>Template:Cite journal</ref>

Structure learningEdit

Filter feature selection is a specific case of a more general paradigm called structure learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms assume the data is generated by a Bayesian Network, and so the structure is a directed graphical model. The optimal solution to the filter feature selection problem is the Markov blanket of the target node, and in a Bayesian Network, there is a unique Markov Blanket for each node.<ref>Template:Cite journal</ref>

Information Theory Based Feature Selection MechanismsEdit

There are different Feature Selection mechanisms around that utilize mutual information for scoring the different features. They usually use all the same algorithm:

Calculate the mutual information as score for between all features (<math> f_{i} \in F </math>) and the target class (Template:Mvar)
Select the feature with the largest score (e.g. <math>\underset{f_{i} \in F}\operatorname{argmax}(I(f_{i},c))</math>) and add it to the set of selected features (Template:Mvar)
Calculate the score which might be derived from the mutual information
Select the feature with the largest score and add it to the set of select features (e.g. <math>\underset{f_{i} \in F}\operatorname{argmax}(I_{derived}(f_{i},c))</math>)
Repeat 3. and 4. until a certain number of features is selected (e.g. <math>|S|=l</math>)

The simplest approach uses the mutual information as the "derived" score.<ref name="Brown">Template:Cite journal [1]</ref>

However, there are different approaches, that try to reduce the redundancy between features.

Minimum-redundancy-maximum-relevance (mRMR) feature selectionEdit

Peng et al.<ref>Template:Cite journal Program</ref> proposed a feature selection method that can use either mutual information, correlation, or distance/similarity scores to select features. The aim is to penalise a feature's relevancy by its redundancy in the presence of the other selected features. The relevance of a feature set Template:Mvar for the class Template:Mvar is defined by the average value of all mutual information values between the individual feature Template:Math and the class Template:Mvar as follows:

<math> D(S,c) = \frac{1}{|S|}\sum_{f_{i}\in S}I(f_{i};c) </math>.

The redundancy of all features in the set Template:Mvar is the average value of all mutual information values between the feature Template:Math and the feature Template:Math:

The mRMR criterion is a combination of two measures given above and is defined as follows:

<math>\mathrm{mRMR}= \max_{S}

\left[\frac{1}{|S|}\sum_{f_{i}\in S}I(f_{i};c) - \frac{1}{|S|^{2}}\sum_{f_{i},f_{j}\in S}I(f_{i};f_{j})\right].</math>

Suppose that there are Template:Mvar full-set features. Let Template:Math be the set membership indicator function for feature Template:Math, so that Template:Math indicates presence and Template:Math indicates absence of the feature Template:Math in the globally optimal feature set. Let <math>c_i=I(f_i;c)</math> and <math>a_{ij}=I(f_i;f_j)</math>. The above may then be written as an optimization problem:

<math>\mathrm{mRMR}= \max_{x\in \{0,1\}^{n}}

\left[\frac{\sum^{n}_{i=1}c_{i}x_{i}}{\sum^{n}_{i=1}x_{i}} - \frac{\sum^{n}_{i,j=1}a_{ij}x_{i}x_{j}} {(\sum^{n}_{i=1}x_{i})^{2}}\right].</math>

The mRMR algorithm is an approximation of the theoretically optimal maximum-dependency feature selection algorithm that maximizes the mutual information between the joint distribution of the selected features and the classification variable. As mRMR approximates the combinatorial estimation problem with a series of much smaller problems, each of which only involves two variables, it thus uses pairwise joint probabilities which are more robust. In certain situations the algorithm may underestimate the usefulness of features as it has no way to measure interactions between features which can increase relevancy. This can lead to poor performance<ref name="Brown" /> when the features are individually useless, but are useful when combined (a pathological case is found when the class is a parity function of the features). Overall the algorithm is more efficient (in terms of the amount of data required) than the theoretically optimal max-dependency selection, yet produces a feature set with little pairwise redundancy.

mRMR is an instance of a large class of filter methods which trade off between relevancy and redundancy in different ways.<ref name="Brown"/><ref name="docs.google">Nguyen, H., Franke, K., Petrovic, S. (2010). "Towards a Generic Feature-Selection Measure for Intrusion Detection", In Proc. International Conference on Pattern Recognition (ICPR), Istanbul, Turkey. [2]</ref>

Quadratic programming feature selectionEdit

mRMR is a typical example of an incremental greedy strategy for feature selection: once a feature has been selected, it cannot be deselected at a later stage. While mRMR could be optimized using floating search to reduce some features, it might also be reformulated as a global quadratic programming optimization problem as follows:<ref name="QPFS">Template:Cite journal</ref>

<math>

\mathrm{QPFS}: \min_\mathbf{x} \left\{ \alpha \mathbf{x}^T H \mathbf{x} - \mathbf{x}^T F\right\} \quad \mbox{s.t.} \ \sum_{i=1}^n x_i=1, x_i\geq 0 </math>

where <math>F_{n\times1}=[I(f_1;c),\ldots, I(f_n;c)]^T</math> is the vector of feature relevancy assuming there are Template:Mvar features in total, <math>H_{n\times n}=[I(f_i;f_j)]_{i,j=1\ldots n}</math> is the matrix of feature pairwise redundancy, and <math>\mathbf{x}_{n\times 1}</math> represents relative feature weights. QPFS is solved via quadratic programming. It is recently shown that QFPS is biased towards features with smaller entropy,<ref name="CMI" /> due to its placement of the feature self redundancy term <math>I(f_i;f_i)</math> on the diagonal of Template:Mvar.

Conditional mutual informationEdit

Another score derived for the mutual information is based on the conditional relevancy:<ref name="CMI">Nguyen X. Vinh, Jeffrey Chan, Simone Romano and James Bailey, "Effective Global Approaches for Mutual Information based Feature Selection". Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), August 24–27, New York City, 2014. "[3]"</ref>

<math>

\mathrm{SPEC_{CMI}}: \max_{\mathbf{x}} \left\{\mathbf{x}^T Q \mathbf{x}\right\} \quad \mbox{s.t.}\ \|\mathbf{x}\|=1, x_i\geq 0 </math>

where <math>Q_{ii}=I(f_i;c)</math> and <math>Q_{ij}=(I(f_i;c|f_j)+I(f_j;c|f_i))/2, i\ne j</math>.

An advantage of Template:Math is that it can be solved simply via finding the dominant eigenvector of Template:Mvar, thus is very scalable. Template:Math also handles second-order feature interaction.

Joint mutual informationEdit

In a study of different scores Brown et al.<ref name="Brown" /> recommended the joint mutual information<ref>Template:Cite journal</ref> as a good score for feature selection. The score tries to find the feature, that adds the most new information to the already selected features, in order to avoid redundancy. The score is formulated as follows:

<math>

\begin{align} JMI(f_i) &= \sum_{f_j \in S} (I(f_i;c) + I(f_i;c|f_j)) \\

        &= \sum_{f_j \in S} \bigl[ I (f_j;c) + I (f_i;c) - \bigl(I (f_i;f_j) - I (f_i;f_j|c)\bigr)\bigr]

\end{align} </math>

The score uses the conditional mutual information and the mutual information to estimate the redundancy between the already selected features (<math> f_j \in S </math>) and the feature under investigation (<math>f_i</math>).

Hilbert-Schmidt Independence Criterion Lasso based feature selectionEdit

For high-dimensional and small sample data (e.g., dimensionality > Template:10^ and the number of samples < Template:10^), the Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) is useful.<ref name="HSICLasso">Template:Cite journal</ref> HSIC Lasso optimization problem is given as

<math>

\mathrm{HSIC_{Lasso}}: \min_{\mathbf{x}} \frac{1}{2}\sum_{k,l = 1}^n x_k x_l {\mbox{HSIC}}(f_k,f_l) - \sum_{k = 1}^n x_k {\mbox{HSIC}}(f_k,c) + \lambda \|\mathbf{x}\|_1, \quad \mbox{s.t.} \ x_1,\ldots, x_n \geq 0, </math>

where <math>{\mbox{HSIC}}(f_k,c) =\mbox{tr}(\bar{\mathbf{K}}^{(k)} \bar{\mathbf{L}})</math> is a kernel-based independence measure called the (empirical) Hilbert-Schmidt independence criterion (HSIC), <math>\mbox{tr}(\cdot)</math> denotes the trace, <math>\lambda</math> is the regularization parameter, <math>\bar{\mathbf{K}}^{(k)} = \mathbf{\Gamma} \mathbf{K}^{(k)} \mathbf{\Gamma}</math> and <math>\bar{\mathbf{L}} = \mathbf{\Gamma} \mathbf{L} \mathbf{\Gamma}</math> are input and output centered Gram matrices, <math>K^{(k)}_{i,j} = K(u_{k,i},u_{k,j})</math> and <math>L_{i,j} = L(c_i,c_j)</math> are Gram matrices, <math>K(u,u')</math> and <math>L(c,c')</math> are kernel functions, <math>\mathbf{\Gamma} = \mathbf{I}_m - \frac{1}{m}\mathbf{1}_m \mathbf{1}_m^T</math> is the centering matrix, <math>\mathbf{I}_m</math> is the Template:Mvar-dimensional identity matrix (Template:Mvar: the number of samples), <math>\mathbf{1}_m</math> is the Template:Mvar-dimensional vector with all ones, and <math>\|\cdot\|_{1}</math> is the <math>\ell_1</math>-norm. HSIC always takes a non-negative value, and is zero if and only if two random variables are statistically independent when a universal reproducing kernel such as the Gaussian kernel is used.

The HSIC Lasso can be written as

<math>

\mathrm{HSIC_{Lasso}}: \min_{\mathbf{x}} \frac{1}{2}\left\|\bar{\mathbf{L}} - \sum_{k = 1}^{n} x_k \bar{\mathbf{K}}^{(k)} \right\|^2_{F} + \lambda \|\mathbf{x}\|_1, \quad \mbox{s.t.} \ x_1,\ldots,x_n \geq 0, </math>

where <math>\|\cdot\|_{F}</math> is the Frobenius norm. The optimization problem is a Lasso problem, and thus it can be efficiently solved with a state-of-the-art Lasso solver such as the dual augmented Lagrangian method.

Correlation feature selectionEdit

The correlation feature selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other".<ref>Template:Cite thesis</ref><ref>Template:Cite book</ref> The following equation gives the merit of a feature subset S consisting of k features:

<math> \mathrm{Merit}_{S_{k}} = \frac{k\overline{r_{cf}}}{\sqrt{k+k(k-1)\overline{r_{ff}}}}.</math>

Here, <math> \overline{r_{cf}} </math> is the average value of all feature-classification correlations, and <math> \overline{r_{ff}} </math> is the average value of all feature-feature correlations. The CFS criterion is defined as follows:

<math>\mathrm{CFS} = \max_{S_k}

\left[\frac{r_{c f_1}+r_{c f_2}+\cdots+r_{c f_k}} {\sqrt{k+2(r_{f_1 f_2}+\cdots+r_{f_i f_j}+ \cdots + r_{f_k f_{k-1} })}}\right].</math>

The <math>r_{cf_{i}}</math> and <math>r_{f_{i}f_{j}}</math> variables are referred to as correlations, but are not necessarily Pearson's correlation coefficient or Spearman's ρ. Hall's dissertation uses neither of these, but uses three different measures of relatedness, minimum description length (MDL), symmetrical uncertainty, and relief.

Let x_i be the set membership indicator function for feature f_i; then the above can be rewritten as an optimization problem:

<math>\mathrm{CFS} = \max_{x\in \{0,1\}^{n}}

\left[\frac{(\sum^{n}_{i=1}a_{i}x_{i})^{2}} {\sum^{n}_{i=1}x_i + \sum_{i\neq j} 2b_{ij} x_i x_j }\right].</math>

The combinatorial problems above are, in fact, mixed 0–1 linear programming problems that can be solved by using branch-and-bound algorithms.<ref>Template:Cite journal</ref>

Regularized treesEdit

The features from a decision tree or a tree ensemble are shown to be redundant. A recent method called regularized tree<ref name="DengRunger2012">H. Deng, G. Runger, "Feature Selection via Regularized Trees", Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012</ref> can be used for feature subset selection. Regularized trees penalize using a variable similar to the variables selected at previous tree nodes for splitting the current node. Regularized trees only need build one tree model (or one tree ensemble model) and thus are computationally efficient.

Regularized trees naturally handle numerical and categorical features, interactions and nonlinearities. They are invariant to attribute scales (units) and insensitive to outliers, and thus, require little data preprocessing such as normalization. Regularized random forest (RRF)<ref name="RRF">RRF: Regularized Random Forest, R package on CRAN</ref> is one type of regularized trees. The guided RRF is an enhanced RRF which is guided by the importance scores from an ordinary random forest.

Overview on metaheuristics methodsEdit

A metaheuristic is a general description of an algorithm dedicated to solve difficult (typically NP-hard problem) optimization problems for which there is no classical solving methods. Generally, a metaheuristic is a stochastic algorithm tending to reach a global optimum. There are many metaheuristics, from a simple local search to a complex global search algorithm.

Main principlesEdit

The feature selection methods are typically presented in three classes based on how they combine the selection algorithm and the model building.

Filter methodEdit

File:Filter Methode.png

Filter Method for feature selection

Filter type methods select variables regardless of the model. They are based only on general features like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting.<ref name="ReferenceA">Template:Cite thesis</ref>

Filter methods tend to select redundant variables when they do not consider the relationships between variables. However, more elaborate features try to minimize this problem by removing variables highly correlated to each other, such as the Fast Correlation Based Filter (FCBF) algorithm.<ref>Template:Cite journal</ref>

Wrapper methodEdit

File:Feature selection Wrapper Method.png

Wrapper Method for Feature selection

Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions amongst variables.<ref name="M. Phuong, Z pages 301-309">T. M. Phuong, Z. Lin et R. B. Altman. Choosing SNPs using feature selection. Template:Webarchive Proceedings / IEEE Computational Systems Bioinformatics Conference, CSB. IEEE Computational Systems Bioinformatics Conference, pages 301-309, 2005. Template:PMID.</ref> The two main disadvantages of these methods are:

The increasing overfitting risk when the number of observations is insufficient.
The significant computation time when the number of variables is large.

Embedded methodEdit

File:Feature selection Embedded Method.png

Embedded method for Feature selection

Embedded methods have been recently proposed that try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously, such as the FRMT algorithm.<ref>Template:Cite journal</ref>

Application of feature selection metaheuristicsEdit

This is a survey of the application of feature selection metaheuristics lately used in the literature. This survey was realized by J. Hammon in her 2013 thesis.<ref name="ReferenceA"/>

Application	Algorithm	Approach	Classifier	Evaluation Function	Reference
SNPs	Feature Selection using Feature Similarity	Filter		r²	Phuong 2005<ref name="M. Phuong, Z pages 301-309"/>
SNPs	Genetic algorithm	Wrapper	Decision Tree	Classification accuracy (10-fold)	Shah 2004<ref>Template:Cite journal</ref>
SNPs	Hill climbing	Filter + Wrapper	Naive Bayesian	Predicted residual sum of squares	Long 2007<ref>Template:Cite journal</ref>
SNPs	Simulated annealing		Naive bayesian	Classification accuracy (5-fold)	Ustunkar 2011<ref>Template:Cite journal</ref>
Segments parole	Ant colony	Wrapper	Artificial Neural Network	MSE	Al-ani 2005 Template:Citation needed
Marketing	Simulated annealing	Wrapper	Regression	AIC, r²	Meiri 2006<ref>Template:Cite journal</ref>
Economics	Simulated annealing, genetic algorithm	Wrapper	Regression	BIC	Kapetanios 2007<ref>Template:Cite journal</ref>
Spectral Mass	Genetic algorithm	Wrapper	Multiple Linear Regression, Partial Least Squares	root-mean-square error of prediction	Broadhurst et al. 1997<ref>Template:Cite journal</ref>
Spam	Binary PSO + Mutation	Wrapper	Decision tree	weighted cost	Zhang 2014<ref name="sciencedirect.com"/>
Microarray	Tabu search + PSO	Wrapper	Support Vector Machine, K Nearest Neighbors	Euclidean Distance	Chuang 2009<ref>Template:Cite journal</ref>
Microarray	PSO + Genetic algorithm	Wrapper	Support Vector Machine	Classification accuracy (10-fold)	Alba 2007<ref>E. Alba, J. Garia-Nieto, L. Jourdan et E.-G. Talbi. Gene Selection in Cancer Classification using PSO-SVM and GA-SVM Hybrid Algorithms. Template:Webarchive Congress on Evolutionary Computation, Singapore: Singapore (2007), 2007</ref>
Microarray	Genetic algorithm + Iterated Local Search	Embedded	Support Vector Machine	Classification accuracy (10-fold)	Duval 2009<ref name="B. Duval, J pages 201-208">B. Duval, J.-K. Hao et J. C. Hernandez Hernandez. A memetic algorithm for gene selection and molecular classification of an cancer. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, GECCO '09, pages 201-208, New York, NY, USA, 2009. ACM.</ref>
Microarray	Iterated local search	Wrapper	Regression	Posterior Probability	Hans 2007<ref>C. Hans, A. Dobra et M. West. Shotgun stochastic search for 'large p' regression. Journal of the American Statistical Association, 2007.</ref>
Microarray	Genetic algorithm	Wrapper	K Nearest Neighbors	Classification accuracy (Leave-one-out cross-validation)	Jirapech-Umpai 2005<ref>Template:Cite journal</ref>
Microarray	Hybrid genetic algorithm	Wrapper	K Nearest Neighbors	Classification accuracy (Leave-one-out cross-validation)	Oh 2004<ref>Template:Cite journal</ref>
Microarray	Genetic algorithm	Wrapper	Support Vector Machine	Sensitivity and specificity	Xuan 2011<ref>Template:Cite journal</ref>
Microarray	Genetic algorithm	Wrapper	All paired Support Vector Machine	Classification accuracy (Leave-one-out cross-validation)	Peng 2003<ref>Template:Cite journal</ref>
Microarray	Genetic algorithm	Embedded	Support Vector Machine	Classification accuracy (10-fold)	Hernandez 2007<ref>Template:Cite book</ref>
Microarray	Genetic algorithm	Hybrid	Support Vector Machine	Classification accuracy (Leave-one-out cross-validation)	Huerta 2006<ref>Template:Cite book</ref>
Microarray	Genetic algorithm		Support Vector Machine	Classification accuracy (10-fold)	Muni 2006<ref>Template:Cite journal</ref>
Microarray	Genetic algorithm	Wrapper	Support Vector Machine	EH-DIALL, CLUMP	Jourdan 2005<ref>Template:Cite journal</ref>
Alzheimer's disease	Welch's t-test	Filter	Support vector machine	Classification accuracy (10-fold)	Zhang 2015<ref>Template:Cite journal</ref>
Computer vision	Infinite Feature Selection	Filter	Independent	Average Precision, ROC AUC	Roffo 2015<ref>Template:Cite book</ref>
Microarrays	Eigenvector Centrality FS	Filter	Independent	Average Precision, Accuracy, ROC AUC	citation	CitationClass=web }}</ref>
XML	Symmetrical Tau (ST)	Filter	Structural Associative Classification	Accuracy, Coverage	Shaharanee & Hadzic 2014