Editing Feature selection (section)

==Information Theory Based Feature Selection Mechanisms==

There are different Feature Selection mechanisms around that utilize [[mutual information]] for scoring the different features. They usually use all the same algorithm:

# Calculate the [[mutual information]] as score for between all features (<math> f_{i} \in F </math>) and the target class ({{mvar|c}})
# Select the feature with the largest score (e.g. <math>\underset{f_{i} \in F}\operatorname{argmax}(I(f_{i},c))</math>) and add it to the set of selected features ({{mvar|S}})
# Calculate the score which might be derived from the [[mutual information]]
# Select the feature with the largest score and add it to the set of select features (e.g. <math>\underset{f_{i} \in F}\operatorname{argmax}(I_{derived}(f_{i},c))</math>)
# Repeat 3. and 4. until a certain number of features is selected (e.g. <math>|S|=l</math>)

The simplest approach uses the [[mutual information]] as the "derived" score.<ref name="Brown">{{cite journal |last1=Brown |first1=Gavin |last2=Pocock |first2=Adam |last3=Zhao |first3=Ming-Jie |last4=Luján |first4=Mikel |date=2012 |title=Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection |url=http://dl.acm.org/citation.cfm?id=2188385.2188387 |journal=[[Journal of Machine Learning Research]] |volume=13 |pages=27–66}}[http://www.jmlr.org/papers/volume13/brown12a/brown12a.pdf]</ref>

However, there are different approaches, that try to reduce the redundancy between features.

===Minimum-redundancy-maximum-relevance (mRMR) feature selection===
Peng ''et al.''<ref>{{cite journal |last1=Peng |first1=H. C. |last2=Long |first2=F. |last3=Ding |first3=C. |title=Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy |journal= [[IEEE Transactions on Pattern Analysis and Machine Intelligence]] |volume=27 |issue=8 |pages=1226–1238 |year=2005 |doi=10.1109/TPAMI.2005.159 |pmid=16119262|citeseerx=10.1.1.63.5765 |s2cid=206764015 }} [http://home.penglab.com/proj/mRMR/index.htm Program]</ref> proposed a feature selection method that can use either mutual information, correlation, or distance/similarity scores to select features. The aim is to penalise a feature's relevancy by its redundancy in the presence of the other selected features. The relevance of a feature set {{mvar|S}} for the class {{mvar|c}} is defined by the average value of all mutual information values between the individual feature {{math|''f<sub>i</sub>''}} and the class {{mvar|c}} as follows:

:<math> D(S,c) = \frac{1}{|S|}\sum_{f_{i}\in S}I(f_{i};c) </math>.

The redundancy of all features in the set {{mvar|S}} is the average value of all mutual information values between the feature {{math|''f<sub>i</sub>''}} and the feature {{math|''f<sub>j</sub>''}}:

:<math> R(S) = \frac{1}{|S|^{2}}\sum_{f_{i},f_{j}\in S}I(f_{i};f_{j})</math>

The mRMR criterion is a combination of two measures given above and is defined as follows:

:<math>\mathrm{mRMR}= \max_{S}
\left[\frac{1}{|S|}\sum_{f_{i}\in S}I(f_{i};c) - 
\frac{1}{|S|^{2}}\sum_{f_{i},f_{j}\in S}I(f_{i};f_{j})\right].</math>

Suppose that there are {{mvar|n}} full-set features. Let {{math|''x<sub>i</sub>''}} be the set membership [[indicator function]] for feature {{math|''f<sub>i</sub>''}}, so that {{math|1=''x<sub>i</sub>''=1}} indicates presence and {{math|1=''x<sub>i</sub>''=0}} indicates absence of the feature {{math|''f<sub>i</sub>''}} in the globally optimal feature set. Let <math>c_i=I(f_i;c)</math> and <math>a_{ij}=I(f_i;f_j)</math>. The above may then be written as an optimization problem:

:<math>\mathrm{mRMR}= \max_{x\in \{0,1\}^{n}} 
\left[\frac{\sum^{n}_{i=1}c_{i}x_{i}}{\sum^{n}_{i=1}x_{i}} -
\frac{\sum^{n}_{i,j=1}a_{ij}x_{i}x_{j}}
{(\sum^{n}_{i=1}x_{i})^{2}}\right].</math>

The mRMR algorithm is an approximation of the theoretically optimal maximum-dependency feature selection algorithm that maximizes the mutual information between the joint distribution of the selected features and the classification variable. As mRMR approximates the combinatorial estimation problem with a series of much smaller problems, each of which only involves two variables, it thus uses pairwise joint probabilities which are more robust. In certain situations the algorithm may underestimate the usefulness of features as it has no way to measure interactions between features which can increase relevancy. This can lead to poor performance<ref name="Brown" /> when the features are individually useless, but are useful when combined (a pathological case is found when the class is a [[parity function]] of the features). Overall the algorithm is more efficient (in terms of the amount of data required) than the theoretically optimal max-dependency selection, yet produces a feature set with little pairwise redundancy.

mRMR is an instance of a large class of filter methods which trade off between relevancy and redundancy in different ways.<ref name="Brown"/><ref name="docs.google">Nguyen, H., Franke, K., Petrovic, S. (2010). "Towards a Generic Feature-Selection Measure for Intrusion Detection", In Proc. International Conference on Pattern Recognition (ICPR), Istanbul, Turkey. [https://www.researchgate.net/publication/220928649_Towards_a_Generic_Feature-Selection_Measure_for_Intrusion_Detection?ev=prf_pub]</ref>

=== Quadratic programming feature selection ===
mRMR is a typical example of an incremental greedy strategy for feature selection: once a feature has been selected, it cannot be deselected at a later stage. While mRMR could be optimized using floating search to reduce some features, it might also be reformulated as a global [[quadratic programming]] optimization problem as follows:<ref name="QPFS">{{cite journal |first1=I. |last1=Rodriguez-Lujan |first2=R. |last2=Huerta |first3=C. |last3=Elkan |first4=C. |last4=Santa Cruz |title=Quadratic programming feature selection |journal=[[Journal of Machine Learning Research|JMLR]] |volume=11 |pages=1491–1516 |year=2010 |url=http://jmlr.csail.mit.edu/papers/volume11/rodriguez-lujan10a/rodriguez-lujan10a.pdf}}</ref>

:<math>
\mathrm{QPFS}: \min_\mathbf{x}  \left\{ \alpha \mathbf{x}^T H \mathbf{x} -  \mathbf{x}^T F\right\} \quad \mbox{s.t.} \ \sum_{i=1}^n x_i=1, x_i\geq 0
</math>

where <math>F_{n\times1}=[I(f_1;c),\ldots, I(f_n;c)]^T</math> is the vector of feature relevancy assuming there are {{mvar|n}} features in total, <math>H_{n\times n}=[I(f_i;f_j)]_{i,j=1\ldots n}</math> is the matrix of feature pairwise redundancy, and <math>\mathbf{x}_{n\times 1}</math> represents relative feature weights. QPFS is solved via quadratic programming. It is recently shown that QFPS is biased towards features with smaller entropy,<ref name="CMI" /> due to its placement of the feature self redundancy term <math>I(f_i;f_i)</math> on the diagonal of {{mvar|H}}.

=== Conditional mutual information ===

Another score derived for the mutual information is based on the conditional relevancy:<ref name="CMI">Nguyen X. Vinh, Jeffrey Chan, Simone Romano and James Bailey, "Effective Global Approaches for Mutual Information based Feature Selection". Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), August 24–27, New York City, 2014. "[http://people.eng.unimelb.edu.au/baileyj/papers/frp0038-Vinh.pdf]"</ref>

:<math>
\mathrm{SPEC_{CMI}}: \max_{\mathbf{x}} \left\{\mathbf{x}^T Q \mathbf{x}\right\} \quad \mbox{s.t.}\ \|\mathbf{x}\|=1, x_i\geq 0
</math>

where <math>Q_{ii}=I(f_i;c)</math> and <math>Q_{ij}=(I(f_i;c|f_j)+I(f_j;c|f_i))/2, i\ne j</math>.

An advantage of {{math|SPEC<sub>CMI</sub>}} is that it can be solved simply via finding the dominant eigenvector of {{mvar|Q}}, thus is very scalable. {{math|SPEC<sub>CMI</sub>}} also handles second-order feature interaction.

=== Joint mutual information ===

In a study of different scores Brown et al.<ref name="Brown" />  recommended the [[joint mutual information]]<ref>{{cite journal |last1=Yang |first1=Howard Hua |last2=Moody |first2=John |title=Data visualization and feature selection: New algorithms for nongaussian data |journal=Advances in Neural Information Processing Systems |date=2000 |pages=687–693 |url=https://papers.nips.cc/paper/1779-data-visualization-and-feature-selection-new-algorithms-for-nongaussian-data.pdf}}</ref> as a good score for feature selection. The score tries to find the feature, that adds the most new information to the already selected features, in order to avoid redundancy. The score is formulated as follows:

:<math>
\begin{align}
JMI(f_i) &= \sum_{f_j \in S} (I(f_i;c) + I(f_i;c|f_j)) \\
         &= \sum_{f_j \in S} \bigl[ I (f_j;c) + I (f_i;c) - \bigl(I (f_i;f_j) - I (f_i;f_j|c)\bigr)\bigr]
\end{align}
</math>

The score uses the [[conditional mutual information]] and the [[mutual information]] to estimate the redundancy between the already selected features (<math> f_j \in S </math>) and the feature under investigation (<math>f_i</math>).