Editing Independent component analysis (section)

=== Projection pursuit ===

Signal mixtures tend to have Gaussian probability density functions, and source signals tend to have non-Gaussian probability density functions.  Each source signal can be extracted from a set of signal mixtures by taking the inner product of a weight vector and those signal mixtures where this inner product provides an orthogonal projection of the signal mixtures.  The remaining challenge is finding such a weight vector. One type of method for doing so is [[projection pursuit]].<ref name="James V. Stone 2004">James V. Stone(2004); "Independent Component Analysis: A Tutorial Introduction", The MIT Press Cambridge, Massachusetts, London, England; {{ISBN|0-262-69315-1}}</ref><ref>Kruskal, JB. 1969; "Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new "index of condensation", Pages 427–440 of: Milton, RC, & Nelder, JA (eds), Statistical computation; New York, Academic Press</ref>

Projection pursuit seeks one projection at a time such that the extracted signal is as non-Gaussian as possible. This contrasts with ICA, which typically extracts ''M'' signals simultaneously from ''M'' signal mixtures, which requires estimating a ''M'' × ''M'' unmixing matrix. One practical advantage of projection pursuit over ICA is that fewer than ''M'' signals can be extracted if required, where each source signal is extracted from ''M'' signal mixtures using an ''M''-element weight vector.

We can use [[kurtosis]] to recover the multiple source signal by finding the correct weight vectors with the use of projection pursuit.

The kurtosis of the probability density function of a signal, for a finite sample, is computed as

:<math>
K=\frac{\operatorname{E}[(\mathbf{y}-\mathbf{\overline{y}})^4]}{(\operatorname{E}[(\mathbf{y}-\mathbf{\overline{y}})^2])^2}-3 
</math>

where <math>\mathbf{\overline{y}}</math> is the [[sample mean]] of <math>\mathbf{y}</math>, the extracted signals. The constant 3 ensures that Gaussian signals have zero kurtosis, Super-Gaussian signals have positive kurtosis, and Sub-Gaussian signals have negative kurtosis. The denominator is the [[variance]] of <math>\mathbf{y}</math>, and ensures that the measured kurtosis takes account of signal variance. The goal of projection pursuit is to maximize the kurtosis, and make the extracted signal as non-normal as possible.

Using kurtosis as a measure of non-normality, we can now examine how the kurtosis of a signal <math>\mathbf{y} = \mathbf{w}^T \mathbf{x}</math> extracted from a set of ''M'' mixtures <math>\mathbf{x}=(x_1,x_2,\ldots,x_M)^T</math> varies as the weight vector <math>\mathbf{w}</math> is rotated around the origin. Given our assumption that each source signal <math>\mathbf{s}</math> is super-gaussian we would expect:
#the kurtosis of the extracted signal <math>\mathbf{y}</math> to be maximal precisely when <math>\mathbf{y} = \mathbf{s}</math>.
#the kurtosis of the extracted signal <math>\mathbf{y}</math> to be maximal when <math>\mathbf{w}</math> is orthogonal to the projected axes <math>S_1</math> or <math>S_2</math>, because we know the optimal weight vector should be orthogonal to a transformed axis <math>S_1</math> or <math>S_2</math>.

For multiple source mixture signals, we can use kurtosis and [[Gram-Schmidt]] Orthogonalization (GSO) to recover the signals. Given ''M'' signal mixtures in an ''M''-dimensional space, GSO project these data points onto an (''M-1'')-dimensional space by using the weight vector. We can guarantee the independence of the extracted signals with the use of GSO.

In order to find the correct value of <math>\mathbf{w}</math>, we can use [[gradient descent]] method.  We first of all whiten the data, and transform <math>\mathbf{x}</math> into a new mixture <math>\mathbf{z}</math>, which has unit variance, and <math>\mathbf{z}=(z_1,z_2,\ldots,z_M)^T</math>. This process can be achieved by applying [[Singular value decomposition]] to <math>\mathbf{x}</math>,

: <math>\mathbf{x} = \mathbf{U} \mathbf{D} \mathbf{V}^T</math>

Rescaling each vector <math>U_i=U_i/\operatorname{E}(U_i^2)</math>, and let <math>\mathbf{z} = \mathbf{U}</math>. The signal extracted by a weighted vector <math>\mathbf{w}</math> is <math>\mathbf{y} = \mathbf{w}^T \mathbf{z}</math>. If the weight vector '''w''' has unit length, then the variance of '''y''' is also 1, that is <math>\operatorname{E}[(\mathbf{w}^T \mathbf{z})^2]=1</math>. The kurtosis can thus be written as:

:<math>
K=\frac{\operatorname{E}[\mathbf{y}^4]}{(\operatorname{E}[\mathbf{y}^2])^2}-3=\operatorname{E}[(\mathbf{w}^T \mathbf{z})^4]-3. 
</math>

The updating process for <math>\mathbf{w}</math> is:
:<math>\mathbf{w}_{new}=\mathbf{w}_{old}-\eta\operatorname{E}[\mathbf{z}(\mathbf{w}_{old}^T \mathbf{z})^3 ].</math>
where <math>\eta</math> is a small constant to guarantee that <math>\mathbf{w}</math> converges to the optimal solution. After each update, we normalize <math>\mathbf{w}_{new}=\frac{\mathbf{w}_{new}}{|\mathbf{w}_{new}|}</math>, and set <math>\mathbf{w}_{old}=\mathbf{w}_{new}</math>, and repeat the updating process until convergence. We can also use another algorithm to update the weight vector <math>\mathbf{w}</math>.

Another approach is using [[negentropy]]<ref name=comon94/><ref>{{cite journal|last=Hyvärinen|first=Aapo|author2=Erkki Oja|title=Independent Component Analysis:Algorithms and Applications|journal=Neural Networks|year=2000|volume=13|issue=4–5|series=4-5|pages=411–430|doi=10.1016/s0893-6080(00)00026-5|pmid=10946390|citeseerx=10.1.1.79.7003|s2cid=11959218 }}</ref> instead of kurtosis. Using negentropy is a more robust method than kurtosis, as kurtosis is very sensitive to outliers. The negentropy methods are based on an important property of Gaussian distribution: a Gaussian variable has the largest entropy among all continuous random variables of equal variance. This is also the reason why we want to find the most nongaussian variables. A simple proof can be found in [[Differential entropy]].
:<math>J(x) = S(y) - S(x)\,</math>

y is a Gaussian random variable of the same covariance matrix as x

:<math>S(x) = - \int p_x(u) \log p_x(u) du</math>

An approximation for negentropy is
:<math>J(x)=\frac{1}{12}(E(x^3))^2 + \frac{1}{48}(kurt(x))^2</math> 
A proof can be found in the original papers of Comon;<ref name=pc91/><ref name=comon94/> it has been reproduced in the book ''Independent Component Analysis'' by Aapo Hyvärinen, Juha Karhunen, and [[Erkki Oja]]<ref>{{cite book|first1=Aapo |last1=Hyvärinen |first2=Juha |last2=Karhunen |first3=Erkki |last3=Oja|title=Independent component analysis|year=2001|publisher=Wiley|location=New York, NY |isbn=978-0-471-40540-5|edition=Reprint}}</ref> This approximation also suffers from the same problem as kurtosis (sensitivity to outliers). Other approaches have been developed.<ref>{{cite journal|last=Hyvärinen|first=Aapo|title=New approximations of differential entropy for independent component analysis and projection pursuit.|journal=Advances in Neural Information Processing Systems|year=1998|volume=10|pages=273–279}}</ref> 
:<math>J(y) = k_1(E(G_1(y)))^2 + k_2(E(G_2(y)) - E(G_2(v))^2</math>
A choice of <math>G_1</math> and <math>G_2</math> are 
:<math>G_1 = \frac{1}{a_1}\log(\cosh(a_1u))</math> and <math>G_2 = -\exp(-\frac{u^2}{2})</math>