Editing Feature selection (section)

==Subset selection==
Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a [[search algorithm]] to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model.

Many popular search approaches use [[greedy algorithm|greedy]] [[hill climbing]], which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring [[Metric (mathematics)|metric]] that grades a subset of features.  Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset.  The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc.

Alternative search-based techniques are based on [[targeted projection pursuit]] which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected.

Search approaches include:

* Exhaustive<ref>{{cite arXiv |last1=Hazimeh |first1=Hussein| last2=Mazumder |first2=Rahul |last3=Saab |first3=Ali |eprint=2004.06152 |title=Sparse Regression at Scale: Branch-and-Bound rooted in First-Order Optimization |class= stat.CO|date=2020}}</ref>
* [[Best-first search|Best first]]
* [[Simulated annealing]]
* [[Genetic algorithm]]<ref>{{Cite journal|last1=Soufan|first1=Othman|last2=Kleftogiannis|first2=Dimitrios|last3=Kalnis|first3=Panos|last4=Bajic|first4=Vladimir B.|date=2015-02-26|title=DWFS: A Wrapper Feature Selection Tool Based on a Parallel Genetic Algorithm|journal=PLOS ONE|language=en|volume=10|issue=2|pages=e0117988|doi=10.1371/journal.pone.0117988|pmid=25719748|pmc=4342225|issn=1932-6203|bibcode=2015PLoSO..1017988S|doi-access=free}}</ref>
* [[Greedy algorithm|Greedy]] forward selection<ref>{{cite journal|last1=Figueroa|first1=Alejandro|title=Exploring effective features for recognizing the user intent behind web queries|journal=Computers in Industry|date=2015|volume=68|pages=162–169|url=https://www.researchgate.net/publication/271911317|doi=10.1016/j.compind.2015.01.005}}</ref><ref>{{cite conference |last=Figueroa|first=Alejandro |author2=Guenter Neumann  |url=https://www.researchgate.net/publication/259174469  |title=Learning to Rank Effective Paraphrases from Query Logs for Community Question Answering |conference= AAAI |year=2013}}</ref><ref>{{cite journal|last=Figueroa|first=Alejandro |author2=Guenter Neumann|title=Category-specific models for ranking effective paraphrases in community Question Answering|journal=Expert Systems with Applications|date=2014|volume=41|issue=10 |pages=4730–4742|url=https://www.researchgate.net/publication/260519271|doi=10.1016/j.eswa.2014.02.004|hdl=10533/196878|hdl-access=free}}</ref>
* Greedy backward elimination
* [[Particle swarm optimization]]<ref name="sciencedirect.com">{{cite journal|last1=Zhang|first1=Y.|last2=Wang|first2=S.|last3=Phillips|first3=P.|title=Binary PSO with Mutation Operator for Feature Selection using Decision Tree applied to Spam Detection|journal=Knowledge-Based Systems|date=2014|volume=64|pages=22–31|doi=10.1016/j.knosys.2014.03.015}}</ref>
* [[Targeted projection pursuit]]
* Scatter search<ref>F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. [https://pdfs.semanticscholar.org/ea5d/770e97b9330032e8713b0c105b523750a7c3.pdf Solving feature subset selection problem by a Parallel Scatter Search], ''European Journal of Operational Research'', vol. 169, no. 2, pp. 477–489, 2006.
</ref><ref>{{Cite book|chapter-url=https://dl.acm.org/doi/abs/10.1145/3449726.3459481|doi = 10.1145/3449726.3459481|chapter = Scatter search for high-dimensional feature selection using feature grouping|title = Proceedings of the Genetic and Evolutionary Computation Conference Companion|year = 2021|last1 = García-Torres|first1 = Miguel|last2 = Gómez-Vela|first2 = Francisco|last3 = Divina|first3 = Federico|last4 = Pinto-Roa|first4 = Diego P.|last5 = Noguera|first5 = José Luis Vázquez|last6 = Román|first6 = Julio C. Mello|pages = 149–150|isbn = 9781450383516|s2cid = 235770316}}</ref><ref>M. Garcia-Torres. [https://link.springer.com/article/10.1007/s10732-025-09550-9 Feature selection for high-dimensional data using a multivariate search space reduction strategy based scatter search], ''Journal of Heuristics'', vol. 1, no 31, 2025.</ref>
* [[Variable neighborhood search]]<ref>F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. [https://web.archive.org/web/20190830132140/https://pdfs.semanticscholar.org/9428/2985d2c2ea4eb9f49846bedc12003a47db49.pdf Solving Feature Subset Selection Problem by a Hybrid Metaheuristic]. In ''First International Workshop on Hybrid Metaheuristics'', pp. 59–68, 2004.</ref><ref>M. Garcia-Torres, F. Gomez-Vela, B. Melian, J.M. Moreno-Vega. [https://www.researchgate.net/profile/Miguel_Garcia_Torres/publication/229763203_Parallel_Scatter_Search/links/5b2788a00f7e9be8bdaeb0d0/Parallel-Scatter-Search.pdf High-dimensional feature selection via feature grouping: A Variable Neighborhood Search approach], ''Information Sciences'', vol. 326, pp. 102-118, 2016.</ref>

Two popular filter metrics for classification problems are [[correlation]] and [[mutual information]], although neither are true [[metric (mathematics)|metrics]] or 'distance measures' in the mathematical sense, since they fail to obey the [[triangle inequality]] and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'.  These scores are computed between a candidate feature (or set of features) and the desired output category.  There are, however, true metrics that are a simple function of the mutual information;<ref>{{Cite journal|arxiv=q-bio/0311039|last1=Kraskov|first1=Alexander|title=Hierarchical Clustering Based on Mutual Information|last2=Stögbauer|first2=Harald|last3=Andrzejak|first3=Ralph G|last4=Grassberger|first4=Peter|year=2003|bibcode=2003q.bio....11039K}}</ref> see [[mutual information#Metric|here]].

Other available filter metrics include:

* Class separability
** Error probability
** Inter-class distance
** Probabilistic distance
** [[Entropy (Information theory)|Entropy]]
* Consistency-based feature selection
* Correlation-based feature selection