Editing Feature selection (section)

==Introduction==
A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.<ref name="guyon-intro">{{cite journal |title=An Introduction to Variable and Feature Selection |first1=Isabelle |last1=Guyon |first2=André |last2=Elisseeff |journal=[[Journal of Machine Learning Research|JMLR]] |volume=3 |year=2003 |url=http://jmlr.csail.mit.edu/papers/v3/guyon03a.html}}</ref>

* Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem.
* Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the [[mutual information]],<ref name="guyon-intro"/> the [[pointwise mutual information]],<ref name="textcat"/> [[Pearson product-moment correlation coefficient]], [[Relief (feature selection)|Relief-based algorithms]],<ref>{{Cite journal|last1=Urbanowicz|first1=Ryan J.|last2=Meeker|first2=Melissa|last3=LaCava|first3=William|last4=Olson|first4=Randal S.|last5=Moore|first5=Jason H.|title=Relief-Based Feature Selection: Introduction and Review|journal=Journal of Biomedical Informatics|volume=85|pages=189–203|arxiv=1711.08421|pmid=30031057|pmc=6299836|year=2018|doi=10.1016/j.jbi.2018.07.014}}</ref> and inter/intra class distance or the scores of [[Statistical hypothesis testing|significance tests]] for each class/feature combinations.<ref name="textcat">{{cite conference |last1=Yang |first1=Yiming |first2=Jan O. |last2=Pedersen |title=A comparative study on feature selection in text categorization |conference=ICML |year=1997|url=http://www.surdeanu.info/mihai/teaching/ista555-spring15/readings/yang97comparative.pdf}}</ref><ref>{{cite journal |last1=Forman |first1=George |title=An extensive empirical study of feature selection metrics for text classification |journal=Journal of Machine Learning Research |volume=3 |year=2003 |pages=1289–1305|url=http://www.jmlr.org/papers/volume3/forman03a/forman03a.pdf}}</ref> Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model.<ref>{{cite journal|author1=Yishi Zhang|author2=Shujuan Li|author3=Teng Wang|author4=Zigang Zhang|title=Divergence-based feature selection for separate classes|journal=Neurocomputing|date=2013|volume=101|issue=4|pages=32–42|doi=10.1016/j.neucom.2012.06.036}}</ref> This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via [[Cross-validation (statistics)|cross-validation]]. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. One other popular approach is the Recursive Feature Elimination algorithm,<ref>{{cite journal|author1=Guyon I.|author2=Weston J.|author3=Barnhill S.|author4=Vapnik V.|title=Gene selection for cancer classification using support vector machines|journal=Machine Learning|date=2002|volume=46|issue=1–3|pages=389–422|doi=10.1023/A:1012487302797|doi-access=free}}</ref> commonly used with [[Support Vector Machines]] to repeatedly construct a model and remove features with low weights. 
* Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the [[Lasso (statistics)|LASSO]] method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples;<ref name=Bolasso>{{Cite book|last1=Bach|first1=Francis R|title=Proceedings of the 25th international conference on Machine learning - ICML '08 |chapter=Bolasso |date=2008|pages=33–40|doi=10.1145/1390156.1390161|isbn=9781605582054|s2cid=609778}}</ref> [[Elastic net regularization]], which combines the L1 penalty of LASSO with the L2 penalty of [[ridge regression]]; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients.<ref name=FeaLect>{{cite journal|last1=Zare|first1=Habil|title=Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis|journal=BMC Genomics|date=2013|volume=14|issue=Suppl 1 |pages=S14|doi=10.1186/1471-2164-14-S1-S14|pmid=23369194|pmc=3549810 |doi-access=free }}</ref> AEFS further extends LASSO to nonlinear scenario with autoencoders.<ref>{{cite conference |author1=Kai Han|author2=Yunhe Wang|author3=Chao Zhang|author4=Chao Li|author5=Chao Xu|title=Autoencoder inspired unsupervised feature selection |conference=IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |year=2018}}</ref> These approaches tend to be between filters and wrappers in terms of computational complexity.

In traditional [[regression analysis]], the most popular form of feature selection is [[stepwise regression]], which is a wrapper technique.  It is a [[greedy algorithm]] that adds the best feature (or deletes the worst feature) at each round.  The main control issue is deciding when to stop the algorithm.  In machine learning, this is typically done by [[Cross-validation (statistics)|cross-validation]].  In statistics, some criteria are optimized.  This leads to the inherent problem of nesting. More robust methods have been explored, such as [[branch and bound]] and piecewise linear network.