Editing Feature selection (section)

==Hilbert-Schmidt Independence Criterion Lasso based feature selection==

For high-dimensional and small sample data (e.g., dimensionality > {{10^|5}} and the number of samples < {{10^|3}}), the Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) is useful.<ref name="HSICLasso">{{cite journal |first1=M. |last1=Yamada |first2=W. |last2=Jitkrittum |first3=L. |last3=Sigal |first4=E. P. |last4=Xing |first5=M. |last5=Sugiyama |title=High-Dimensional Feature Selection by Feature-Wise Non-Linear Lasso |journal=Neural Computation |volume=26 |issue=1 |pages=185–207 |year=2014 |doi=10.1162/NECO_a_00537 |pmid=24102126 |arxiv=1202.0515 |s2cid=2742785 }}</ref> HSIC Lasso optimization problem is given as

:<math>
\mathrm{HSIC_{Lasso}}: \min_{\mathbf{x}} \frac{1}{2}\sum_{k,l = 1}^n x_k x_l {\mbox{HSIC}}(f_k,f_l) - \sum_{k = 1}^n x_k {\mbox{HSIC}}(f_k,c) +  \lambda \|\mathbf{x}\|_1, \quad \mbox{s.t.} \ x_1,\ldots, x_n \geq 0,
</math>

where <math>{\mbox{HSIC}}(f_k,c) =\mbox{tr}(\bar{\mathbf{K}}^{(k)}  \bar{\mathbf{L}})</math> is a kernel-based independence measure called the (empirical) Hilbert-Schmidt independence criterion (HSIC), <math>\mbox{tr}(\cdot)</math> denotes the [[Trace (linear algebra)|trace]], <math>\lambda</math> is the regularization parameter, <math>\bar{\mathbf{K}}^{(k)} = \mathbf{\Gamma} \mathbf{K}^{(k)} \mathbf{\Gamma}</math> and <math>\bar{\mathbf{L}} = \mathbf{\Gamma} \mathbf{L} \mathbf{\Gamma}</math> are input and output centered [[Gram matrix|Gram matrices]], <math>K^{(k)}_{i,j} = K(u_{k,i},u_{k,j})</math> and <math>L_{i,j} = L(c_i,c_j)</math> are Gram matrices, <math>K(u,u')</math> and <math>L(c,c')</math> are kernel functions, <math>\mathbf{\Gamma} = \mathbf{I}_m - \frac{1}{m}\mathbf{1}_m \mathbf{1}_m^T</math> is the centering matrix, <math>\mathbf{I}_m</math> is the {{mvar|m}}-dimensional [[identity matrix]] ({{mvar|m}}: the number of samples), <math>\mathbf{1}_m</math> is the {{mvar|m}}-dimensional vector with all ones, and  <math>\|\cdot\|_{1}</math> is the <math>\ell_1</math>-norm.  HSIC always takes a non-negative value, and is zero if and only if two random variables are statistically independent when a universal reproducing kernel such as the Gaussian kernel is used.

The HSIC Lasso can be written as

:<math>
\mathrm{HSIC_{Lasso}}: \min_{\mathbf{x}} \frac{1}{2}\left\|\bar{\mathbf{L}} - \sum_{k = 1}^{n} x_k \bar{\mathbf{K}}^{(k)} \right\|^2_{F}  +  \lambda \|\mathbf{x}\|_1, \quad \mbox{s.t.} \ x_1,\ldots,x_n \geq 0,
</math>

where <math>\|\cdot\|_{F}</math> is the [[Frobenius norm]]. The optimization problem is a Lasso problem, and thus it can be efficiently solved with a state-of-the-art Lasso solver such as the dual [[augmented Lagrangian method]].