Editing Independent component analysis (section)

=== Based on infomax ===

Infomax ICA<ref name="Bell-Sejnowski">Bell, A. J.; Sejnowski, T. J. (1995). "An Information-Maximization Approach to Blind Separation and Blind Deconvolution", Neural Computation, 7, 1129-1159</ref> is essentially a multivariate, parallel version of projection pursuit. Whereas projection pursuit extracts a series of signals one at a time from a set of ''M'' signal mixtures, ICA extracts ''M'' signals in parallel. This tends to make ICA more robust than projection pursuit.<ref name="ReferenceA">James V. Stone (2004). "Independent Component Analysis: A Tutorial Introduction", The MIT Press
Cambridge, Massachusetts, London, England; {{ISBN|0-262-69315-1}}</ref>

The projection pursuit method uses [[Gram-Schmidt]] orthogonalization to ensure the independence of the extracted signal, while ICA use [[infomax]] and [[maximum likelihood]] estimate to ensure the independence of the extracted signal. The Non-Normality of the extracted signal is achieved by assigning an appropriate model, or prior, for the signal.

The process of ICA based on [[infomax]] in short is: given a set of signal mixtures <math>\mathbf{x}</math> and a set of identical independent model [[cumulative distribution functions]](cdfs) <math>g</math>, we seek the unmixing matrix <math>\mathbf{W}</math> which maximizes the joint [[entropy]] of the signals <math>\mathbf{Y}=g(\mathbf{y})</math>, where <math>\mathbf{y}=\mathbf{Wx}</math> are the signals extracted by <math>\mathbf{W}</math>. Given the optimal <math>\mathbf{W}</math>, the signals <math>\mathbf{Y}</math> have maximum entropy and are therefore independent, which ensures that the extracted signals <math>\mathbf{y}=g^{-1}(\mathbf{Y})</math> are also independent. <math>g</math> is an invertible function, and is the signal model. Note that if the source signal model [[probability density function]] <math>p_s</math> matches the [[probability density function]] of the extracted signal <math>p_{\mathbf{y}}</math>, then maximizing the joint entropy of <math>Y</math> also maximizes the amount of [[mutual information]] between <math>\mathbf{x}</math> and <math>\mathbf{Y}</math>. For this reason, using entropy to extract independent signals is known as [[infomax]].

Consider the entropy of the vector variable <math>\mathbf{Y}=g(\mathbf{y})</math>, where <math>\mathbf{y}=\mathbf{Wx}</math> is the set of signals extracted by the unmixing matrix <math>\mathbf{W}</math>. For a finite set of values sampled from a distribution with pdf <math>p_{\mathbf{y}}</math>, the entropy of <math>\mathbf{Y}</math> can be estimated as:
:<math>
H(\mathbf{Y})=-\frac{1}{N}\sum_{t=1}^N \ln p_{\mathbf{Y}}(\mathbf{Y}^t)
</math>
The joint pdf <math>p_{\mathbf{Y}}</math> can be shown to be related to the joint pdf <math>p_{\mathbf{y}}</math> of the extracted signals by the multivariate form:
:<math>
p_{\mathbf{Y}}(Y)=\frac{p_{\mathbf{y}}(\mathbf{y})}{|\frac{\partial\mathbf{Y}}{\partial \mathbf{y}}|}
</math>

where <math>\mathbf{J}=\frac{\partial\mathbf{Y}}{\partial \mathbf{y}}</math> is the [[Jacobian matrix]]. We have <math>|\mathbf{J}|=g'(\mathbf{y})</math>, and <math>g'</math> is the pdf assumed for source signals <math>g'=p_s</math>, therefore,
:<math>
p_{\mathbf{Y}}(Y)=\frac{p_{\mathbf{y}}(\mathbf{y})}{|\frac{\partial\mathbf{Y}}{\partial \mathbf{y}}|}=\frac{p_\mathbf{y}(\mathbf{y})}{p_\mathbf{s}(\mathbf{y})}
</math>
therefore,
:<math>
H(\mathbf{Y})=-\frac{1}{N}\sum_{t=1}^N \ln\frac{p_\mathbf{y}(\mathbf{y})}{p_\mathbf{s}(\mathbf{y})}
</math>

We know that when <math>p_{\mathbf{y}}=p_s</math>, <math>p_{\mathbf{Y}}</math> is of uniform distribution, and <math>H({\mathbf{Y}})</math> is maximized. Since
:<math>
p_{\mathbf{y}}(\mathbf{y})=\frac{p_\mathbf{x}(\mathbf{x})}{|\frac{\partial\mathbf{y}}{\partial\mathbf{x}}|}=\frac{p_\mathbf{x}(\mathbf{x})}{|\mathbf{W}|}
</math>
where <math>|\mathbf{W}|</math> is the absolute value of the determinant of the unmixing matrix <math>\mathbf{W}</math>. Therefore,
:<math>
H(\mathbf{Y})=-\frac{1}{N}\sum_{t=1}^N \ln\frac{p_\mathbf{x}(\mathbf{x}^t)}{|\mathbf{W}|p_\mathbf{s}(\mathbf{y}^t)}
</math>
so,
:<math>
H(\mathbf{Y})=\frac{1}{N}\sum_{t=1}^N \ln p_\mathbf{s}(\mathbf{y}^t)+\ln|\mathbf{W}|+H(\mathbf{x})
</math>
since <math>H(\mathbf{x})=-\frac{1}{N}\sum_{t=1}^N\ln p_\mathbf{x}(\mathbf{x}^t)</math>, and maximizing <math>\mathbf{W}</math> does not affect <math>H_{\mathbf{x}}</math>, so we can maximize the function
:<math>
h(\mathbf{Y})=\frac{1}{N}\sum_{t=1}^N \ln p_\mathbf{s}(\mathbf{y}^t)+\ln|\mathbf{W}|
</math>
to achieve the independence of the extracted signal.

If there are ''M'' marginal pdfs of the model joint pdf <math>p_{\mathbf{s}}</math> are independent and use the commonly super-gaussian model pdf for the source signals <math>p_{\mathbf{s}}=(1-\tanh(\mathbf{s})^2)</math>, then we have
:<math>
h(\mathbf{Y})=\frac{1}{N}\sum_{i=1}^M\sum_{t=1}^N \ln (1-\tanh(\mathbf{w}_i^\mathsf{T}\mathbf{x}^t)^2)+\ln|\mathbf{W}|
</math>

In the sum, given an observed signal mixture  <math>\mathbf{x}</math>, the corresponding set of extracted signals  <math>\mathbf{y}</math>  and source signal model <math>p_{\mathbf{s}}=g'</math>,  we can find the optimal unmixing matrix <math>\mathbf{W}</math>, and make the extracted signals independent and non-gaussian. Like the projection pursuit situation, we can use gradient descent method to find the optimal solution of the unmixing matrix.